Pattern Labs: LLMs for Data Annotation: How We Build Better Training Data at Scale

This post is part of Pattern Labs, our research initiative focused on translating real-world experimentation into defensible, production-ready workflows for mass tort litigation. Like our previous entry on synthetic medical records, this installment covers research I presented internally to the Pattern team. The topic is data annotation: specifically, how large language models can be used to generate the labeled training data that powers our purpose-built models. We study this work because the quality of that training data directly determines what our platform can do.

One of the less visible parts of building AI for legal work is everything that has to happen before a model can do anything useful. Before Pattern's platform can confirm a diagnosis, flag a product exposure, or surface a qualifying record, a model has to learn what those things look like. It learned from examples. Thousands of them. Someone, or something, had to label those examples first.

That process is called annotation. A few weeks ago I put together a session for the team on how large language models can take on a meaningful share of that work: what the approach looks like, where it performs well, and where it needs human oversight to hold up. The conversation that followed connected quickly to things we are already building. That is usually a sign we picked the right topic.

Here is what I covered, and why it matters for how we think about our platform.

Models need labeled data to learn

Training a machine learning model to do a specific task requires showing it the correct answer, repeatedly, until it learns the pattern. Feed the model a medical record page, tell it whether that page confirms a qualifying diagnosis, and let it adjust based on whether it got it right. Repeat that process across hundreds of thousands of pages, and the model starts to generalize.

The bottleneck is producing those correct answers at scale. Traditional annotation means hiring people to label data. That works, but it runs into problems quickly.

Cost and volume are the first constraint. Models need large amounts of labeled data to train well, and the expense of human annotation scales directly with volume. Quality is the second. When annotators are paid per completed task, speed tends to win over accuracy. Crowd annotation platforms have well-documented problems with annotators rushing through labels to maximize throughput rather than labeling carefully.

Domain expertise is the third constraint, and in legal work it is the most significant. Determining whether a medical record confirms a qualifying diagnosis requires genuine clinical knowledge. Without that background, labels are unreliable regardless of how many you produce.

What changed

Until a few years ago, there was no good shortcut. AI models were specialized. A model trained to do one task could not be redirected to a different task without retraining from scratch, which meant you always needed a fresh supply of human-labeled data to get started.

That changed with the emergence of general-purpose large language models. A landmark 2020 paper, Language Models Are Few-Shot Learners (Brown et al.), demonstrated that a sufficiently large model could perform well on tasks it had never been explicitly trained on, simply by being shown a few examples in the prompt. This established prompting as a serious technique and opened the door to a different approach to annotation entirely.

The core insight is that if you can describe a labeling task clearly and show a model a handful of examples, it can generate labels for new inputs at scale. What previously required a team of annotators can now be approximated by a well-constructed prompt.

How does LLM-driven annotation work in practice?

The process follows a consistent structure. You give the model a task description, a set of valid label options, a few demonstration examples showing correct answers, and the input you want labeled. The model outputs a label. You add it to your dataset and repeat across your full inventory.

To make this concrete: imagine you need to classify whether each page in a claimant's medical records confirms, contradicts, or is silent on a diagnosis. You write an instruction telling the model what to look for, show it a few labeled examples of each category, and run it across thousands of pages. The output is a structured, labeled dataset you can use to train a more specialized model.

(For more on how Pattern preserves clinical context across long records, see our work on improving document segmentation to preserve medical record context at scale.)

One practical challenge is that large language models generate free text by default, and free text does not aggregate cleanly at scale. If a model is supposed to output "confirmed," "not confirmed," or "insufficient information," you need it to output exactly those terms every time, not variations, explanations, or something else entirely.

The technical solution is called constrained decoding. Rather than allowing the model to select from its full vocabulary at each generation step, you restrict it to only the valid outputs you have defined. Anything outside that set is blocked before it can be generated. This is also the mechanism behind structured JSON outputs more broadly — the same approach that ensures a model returns data in a usable format rather than prose. The compute cost of applying this constraint is negligible compared to the inference itself, which makes it practical to use at the volumes litigation work requires.

How does LLM annotation quality compare to human annotation?

Research comparing LLM annotations to crowd worker annotations across multiple benchmark datasets found that ChatGPT accuracy was roughly on par with the median human annotator, and in several tasks meaningfully higher. That result is worth sitting with. In many cases, an LLM annotating data is not trading quality for speed. It is delivering comparable quality at a fraction of the cost and time.

That said, LLM annotation has real failure modes. Models perform worse on tasks or label schemas that deviate from what they were trained on. A sentiment classification task with four gradations performs worse than a simple two-way split, even though the underlying task is not harder. Human annotators are less sensitive to that kind of variation. Hallucinations and systematic biases are also genuine risks, not hypothetical ones.

The right model: Human and LLM together

The approach that holds up best in practice combines both. LLMs generate a large candidate pool of labeled examples efficiently. Human reviewers validate, filter, and correct. You can also use the model's own confidence as a signal, keeping only the labels it is most certain about and discarding the rest, then combining that filtered set with a smaller amount of high-quality human annotation.

Even if an LLM is labeling at 90% accuracy, selective curation combined with targeted human review can push effective training data quality considerably higher than either approach alone.

Why this connects to what we build

We are currently applying this to training a document ranking model that supports litigation development in Pattern. The goal is straightforward: given a query like "find pages documenting a breast cancer diagnosis," return the most relevant pages from a claimant's records, ranked by relevance.

We have substantial case data to work with, which is one of the advantages of operating at Pattern's scale. The gap in our existing training set was query diversity. Our labeled examples skewed toward structured field lookups rather than the kind of open-ended clinical questions a reviewer might actually ask. LLMs can generate multiple distinct queries per document, expanding the training dataset without requiring proportionally more source material. The result is a ranker that generalizes better across the kinds of searches that actually happen in practice.

(For a related look at how we use evolutionary search to refine prompts themselves, see Pattern Labs: Beyond manual tuning with Genetic Prompt Optimization.)

There is also a reasonable question worth addressing directly: if a large language model can annotate a task well enough to train another model, why not skip that step and just use the LLM for the end task?

A few reasons. Task-specific models consistently outperform general-purpose LLMs on the specific task they are trained for. At Pattern, our page labeling and document segmentation models are purpose-built for exactly this reason. They are significantly smaller, faster, and cheaper to run at litigation scale. They do not depend on external API access, which matters when case data cannot leave a controlled environment. And unlike proprietary models, you have full visibility into what you have built and full control over how it behaves.

Large language models are a powerful tool for building training data. The purpose-built models trained on that data are what get deployed in Pattern's platform.

The team at Pattern spends real time on this research, debates the tradeoffs, and connects academic work to what we are building. Questions came up in our session about confidence thresholds for curation, about what it actually means for an LLM to be uncertain about a label, and about when human review changes the outcome versus just confirms it. That kind of conversation is how we make sure the systems we build are ones firms can actually trust.

Raj Patel is a machine learning engineer at Pattern Data.

Frequently asked questions

What is data annotation in machine learning?

Data annotation is the process of labeling examples so that a machine learning model can learn from them. For legal AI, this often means tagging medical record pages with classifications like "confirms diagnosis," "contradicts diagnosis," or "silent" so a model can learn to make those calls on new records. The annotated dataset is what the model trains on, and its quality directly determines how well the trained model performs.

Can large language models replace human annotators?

Not entirely. LLMs can generate large pools of labeled training data quickly and at low cost, and recent benchmarks show their accuracy is comparable to crowd-sourced human annotators on many tasks. But they fail in characteristic ways on fine-grained label schemas, novel tasks, and edge cases. The strongest approach combines LLM-generated labels with targeted human validation, using model confidence to flag which examples need review.

How do mass tort firms benefit from LLM-trained models?

Mass tort dockets contain millions of medical record pages across thousands of claimants. General-purpose LLMs are too slow and too expensive to apply directly at that scale, and case data often cannot leave a controlled environment. Purpose-built models trained on LLM-generated annotations are smaller, cheaper to run, and deployable inside firm-controlled infrastructure. They are what makes inventory-level review and reporting possible across an entire docket.

What is constrained decoding and why does it matter for legal AI?

Constrained decoding restricts a language model's outputs to a predefined set of valid responses, such as "confirmed," "not confirmed," or "insufficient information." Without it, models generate free text that does not aggregate cleanly across thousands of cases. With it, outputs are structured and reliable enough to feed into downstream systems like docket-level scoring and reporting. It is also the mechanism behind structured JSON outputs from modern LLMs.