This post is part of Pattern Labs, our research initiative focused on translating real-world experimentation into defensible, production-ready workflows for mass tort litigation.
Mass tort workflows run on instructions. Every extraction, confirmation of product exposure, identification of a diagnosis date, and verification of litigation criteria is driven by the prompts that tell the system what to look for and how to respond.
At a smaller scale, manual prompt engineering works well enough. The team writes an instruction, checks a few outputs, and adjusts until the results look right. The problem is that manual prompts are brittle. When applied to a large, diverse docket, performance can diminish and inconsistencies compound as cases move through the evaluation process.
In our latest Pattern Labs session, I walked through how automated prompt optimization addresses this problem and why Pattern is now positioned to implement it.
Prompt engineering is a skill. It requires intuition and experimentation, and experienced practitioners are meaningfully better at it than those just getting started. The problem is that even well-crafted prompts are brittle. What works in controlled conditions can break down when exposed to the volume and variability of a real docket.
If someone writes "I thought the book was chilling" in a horror novel review, a system using a basic prompt might flag the sentiment as negative because it associates "chilling" with discomfort. Without context, the model misses that it's actually a compliment. In a high-stakes legal environment, that kind of misread has real consequences.
Prompt optimization replaces "guessing and checking" with mathematical rigor. Instead of a human editor trying to find the perfect wording, optimization algorithms refine instructions iteratively based on performance metrics.
Genetic Algorithm (GA) based methods such as MIPRO and GEPA are particularly valuable in legal tech because they do not require access to a model's internal weights, which are often proprietary.
Genetic optimization works like natural selection for instructions:
Figure 1: The genetic optimization loop, from seed prompt initialization to optimized output.
In practice, Pattern workflows are not single instructions. Case evaluation moves through multiple connected stages: early case screening, record development and gap detection, valuation modeling, and settlement packet generation. Each stage involves its own prompts. Errors at one stage can compound downstream.
Prompt optimization strengthens each step independently and reduces cumulative variance across the full pipeline. Stability must be maintained across all stages, not just isolated tasks.
The biggest hurdle to prompt optimization has always been the fitness function. This is the ability to objectively score whether one prompt is better than another.
Historically, this required massive, manually labeled datasets. However, with the development of the Pattern internal Grader, we now have the infrastructure to optimize at scale. The Grader acts as our automated judge. It evaluates extractions for medical accuracy, alignment with litigation criteria, and structural integrity to give every response a numeric score.
By using the Grader as a fitness function, we can run hundreds of rollouts to systematically find the prompts that minimize errors across the docket.
Automated optimization strengthens the technical performance of each stage in the case evaluation process.
The goal is stability: prompts that hold up across a large, diverse inventory without manual intervention every time criteria shift or new record types appear.
Through our research into frameworks like DSPy, we're exploring a move toward what its developers call programming, not prompting: building the system's logic as structured, testable programs rather than hand-tuned instructions. For the litigation lifecycle, that means prompt quality becomes something you build and measure, not something left to guesswork.
What is a prompt in the context of Pattern's platform? A prompt is the set of instructions that tells the system what to look for and how to structure its response. In our workflows, prompts guide tasks such as extracting medical details, identifying proof of product use, checking alignment with litigation criteria, and producing structured outputs for scoring and settlement preparation.
Why does prompt quality matter in mass tort litigation? Mass tort workflows operate at scale. When thousands of records are being processed, small inconsistencies in instructions can lead to missed data, incorrect extractions, or unnecessary manual review. Clear and structured prompts improve accuracy and consistency across the entire docket.
What is prompt optimization? Prompt optimization is the process of systematically improving the instructions given to the system. Instead of manually rewriting prompts and testing a few examples, optimization evaluates multiple versions against representative case data, measures performance using defined criteria, and keeps the versions that perform better. The result is more stable and reliable outputs.
How is this different from manual prompt engineering? Manual prompt engineering relies on intuition and small-scale testing. It can produce improvements, but performance may degrade when applied across large datasets. Prompt optimization introduces measurement, using structured evaluation to compare variations and refine instructions based on performance data rather than guesswork.
Does prompt optimization require retraining the model? No. Prompt optimization operates at the instruction layer. It improves how tasks are defined without modifying the underlying model. This makes it practical for production systems where model weights are not accessible.
What role does evaluation play in optimization? Optimization depends on a reliable scoring mechanism. In our workflows, outputs are evaluated for correctness, completeness, alignment with litigation criteria, and structural consistency. This allows us to measure improvement objectively and reduce regression risk as criteria evolve.
How does this impact the Screen → Develop → Settle lifecycle? Improvements at the instruction level reduce manual correction and improve case readiness across all stages. Screen benefits from more consistent early case scoring and eligibility signals. Develop sees stronger gap detection and better alignment with evolving MDL criteria. Settle gains confidence in the structured outputs used for valuation and settlement packet generation.