How many documents can be included in an Active Learning saved search?

Discover why the cap of 9 million documents is recommended for Active Learning saved searches. It balances processing speed, training quality, and retrieval relevance, reducing heavy computation while preserving insights. A reminder that quality often beats quantity in data-driven learning. Right?!

Outline:

  • Opening idea: Active Learning in Relativity helps you train smarter, not bigger.
  • What the saved search cap means: 9 million documents is the sweet spot.

  • Why bigger isn’t always better: processing, training time, and relevance matter.

  • How the cap keeps things efficient and accurate.

  • Practical tips to work within the limit without losing signal.

  • Real-world flavor: analogies, quick examples, and a few caveats.

  • Quick recap and takeaways.

Active Learning doesn’t have to be a mystery. It’s the part of Relativity where a model learns from your feedback and starts predicting what’s likely relevant. Think of it as a tutor that gets better the more you show it the kind of documents you care about. When you’re setting up a saved search for Active Learning, there’s a practical ceiling you’ll want to keep in mind: no more than 9 million documents. Yes, really. This number isn’t a magic trick; it’s a thoughtful balance between learning breadth and the system’s ability to process, train, and retrieve efficiently. Let me explain why this cap exists and how you can work with it.

What does 9 million actually mean in practice?

Saved searches in Relativity aren’t just about collecting documents; they’re about feeding a learning loop. The model trains on the set you provide, then it tests and updates its understanding based on new feedback. If you throw too many documents into the mix, you risk slowing down the entire loop. Training cycles can stretch longer, indexing can become heavier, and the time between feedback and useful results grows. In short, too-large datasets can dampen the model’s responsiveness and, paradoxically, reduce accuracy in practice.

If you’re picturing a scale, think of it this way: you want enough material to teach the model what “relevant” looks like, but you don’t want to drown the system in noise, redundancy, and edge cases that won’t help with generalization. Nine million is a practical boundary that keeps training nimble while still offering a rich landscape for the model to learn from. The idea isn’t “more is always better.” It’s “more, but with intention.”

Why not go bigger? The tradeoffs explained

Larger datasets do carry more information, sure. But the returns tend to taper off after a certain point. Here’s what happens as the number climbs well beyond 9 million:

  • Diminishing returns: The additional documents often add repetitive content or noise rather than new, useful signal. The model spends energy distinguishing duplicates rather than learning useful distinctions.

  • Slower iterations: Training rounds extend, and each round can delay next steps in the workflow. If you’re chasing faster turns, that delay matters.

  • Memory and compute pressure: Bigger sets demand more from RAM, GPUs, and storage. That can force compromises elsewhere—like concurrency limits or other tasks waiting in the queue.

  • Retrieval friction: The more you train on, the more time it can take to retrieve the next batch for review or tagging. Speed matters when reviewers need to stay in a steady rhythm.

All of this can erode the very goal Active Learning is designed to support: delivering relevant results quickly and improving accuracy with focused feedback.

Quality over quantity, again and again

The cap encourages you to aim for a high signal-to-noise ratio. If you have 9 million well-curated documents, you’re likely to see clearer patterns and more meaningful progress in your model’s predictions. If you have 25 million with a lot of noise, you’ll be polishing a murky mirror, not a clear lens.

This is where thoughtful curation pays off. The goal is to expose the model to representative, informative examples rather than everything under the sun. You’ll still cover the diversity of custodians, languages, and document types, but you do it with an eye toward utility, not sheer volume.

Small, smart moves that keep you within the limit

You don’t have to feel boxed in by a number. Here are practical strategies to stay within the 9 million ceiling while keeping your Active Learning effective:

  • Start with a core, high-signal seed set: Choose documents you’re confident are relevant and representative. This gives the model a solid starting point.

  • Use stratified sampling: If you’re working across multiple custodians or file types, sample proportionally from each group. This helps the model learn across the spectrum without exploding the total.

  • Remove duplicates and near-duplicates: Reducing redundancy trims the fat and helps the model focus on meaningful differences.

  • Pre-filter with targeted criteria: Apply filters based on keywords, dates, or custodians to prune obvious noise before feeding the set into Active Learning.

  • Clustering for representativeness: A quick clustering pass can help you select a handful of documents from each cluster. You train on the cluster representatives, not every single item.

  • Iterative refinement: After a round of feedback, trim low-value additions and keep the next batch lean. You’re continually sharpening the signal.

  • Monitor the learning curve: Track metrics like precision, recall, and the model’s confidence scores. If gains plateau, it’s a cue to adjust the sample strategy rather than push more data in.

A few analogies to keep it grounded

  • Training a model with Active Learning is a lot like teaching a class with a focused curriculum. You don’t hand every student every possible case; you give representative examples, correct missteps, and refine as you go.

  • Think of the 9 million cap as the size of a well-curated bookshelf. It’s not about cramming every title; it’s about having a shelf that’s rich, varied, and easy to navigate.

  • Consider a chef’s mise en place: you assemble the right ingredients, prepare them well, and keep the kitchen moving. Bigger pantry shelves don’t automatically guarantee tastier results; organization and quality do.

What this means for Relativity users

Relativity’s ecosystem is built around efficiency and precision. Active Learning leverages feedback to improve classification and search relevance, and saved searches are the nerve center of that loop. The 9 million guideline is a practical reminder that the system has to balance training dynamics with timely results. When you respect the cap, you’re more likely to see quicker convergence on relevant documents and a healthier, more predictable feedback cycle.

If you ever feel pulled to push beyond the limit, pause and reassess. Are you increasing signal, or are you just adding more noise? Are you maintaining diversity without spiraling into redundancy? These questions help you recalibrate without losing momentum.

A gentle word on real-world limits and governance

Every environment has its own constraints—hardware, licensing, and workflow SLAs all matter. The 9 million recommendation gives teams a common frame of reference, but you’ll still need to adapt to your setup. If you’re working with unusually large data sources or strict turnaround times, you might experiment with staged training where you alternate between batches and focused sampling. The core idea remains: keep the dataset large enough to inform, but not so large that the learning loop slows to a crawl.

Key takeaways

  • The recommended maximum of 9 million documents balances learning quality with processing efficiency.

  • Bigger isn’t automatically better. When you scale beyond the cap, returns can dwindle while costs rise.

  • Focus on signal-rich, representative data: curate, deduplicate, and sample smartly.

  • Use iterative cycles and monitor learning progress to stay aligned with your goals.

  • Leverage Relativity features—saved searches, analytics, and TAR-style feedback loops—to maintain a steady rhythm and clear visibility into progress.

If you’re exploring Active Learning in this context, remember that the goal isn’t to amass a giant heap of documents. It’s to cultivate a dataset that teaches the model what matters, quickly and reliably. The 9 million guideline is a practical compass—one that helps you navigate the balance between depth of insight and efficiency. With thoughtful curation and a steady, measured approach, you’ll keep your learning loop nimble, your results sharp, and your workflows smooth.

In the end, it’s about clarity and focus. A well-chosen set of up to nine million documents can deliver meaningful improvements without bogging down the system. And that, honestly, is the sweet spot many teams discover after a few cycles of trial and adjustment. If you’re curious to see how this plays out in your own projects, start small, stay intentional, and let the learning unfold. You might be surprised by how quickly the quality of predictions catches up with your expectations—and perhaps even exceeds them.

Subscribe

Get the latest from Examzify

You can unsubscribe at any time. Read our privacy policy