Drip · Engineering Practice · 14 min read

Eval-Driven Development

Frontier models saturated the old benchmarks. Production teams replaced them with a four-stage CI pipeline that treats prompts as code and catches regressions before they ship. The new gold standard for working with LLMs.

The bottom line.Milind Nair’s framework (March 2026, drawing on Adaline + Braintrust production usage) replaced static benchmarks with a four-stage continuous eval pipeline: local dev for fast feedback, PR check to block regressions, deploy gate as the pre-prod quality bar, and production monitor for drift detection. The eval set is sourced from real production failures, not synthetic data. LLM judges have measurable biases (verbosity, position) that need mechanical fixes. Cross-model disagreement becomes a bonus signal — where two frontier models disagree on the same input, you have your next golden-set case for free.

§ 00 · SHIP PROMPTS LIKE YOU SHIP CODEPrompts are code. Test them like code.

For most of 2023 and 2024, evaluating an LLM application meant running a benchmark — MMLU, BBH, MT-Bench, HumanEval — and eyeballing the number. By early 2026 that pattern was broken in two ways. Frontier models had saturated the benchmarks; new models scored the same as old models on every standard suite. And the benchmarks measured the model, not your application — a 92% on MMLU told you nothing about whether your support agent was getting better or worse.

Production teams replaced the benchmarks with a CI pipeline. The unit of evaluation is your prompt + your tools + your wiring, not the model in isolation. The eval runs in CI on every change. A regression on the golden set blocks the merge, the same way a broken test does. The discipline borrowed wholesale from software engineering, applied to the layer above the model.

§ 01 · THE FOUR-STAGE PIPELINEFast feedback, then quality gates, then production drift

Nair’s framing breaks the pipeline into four stages, each with a different speed/coverage tradeoff:

The four stages aren’t four different evals — they’re the same eval logic running at four different cost/coverage points. Build the eval once. Run it everywhere. Calibrate the row count per stage.

§ 02 · GOLDEN DATASETS FROM PRODUCTION200 real failures beat 5,000 synthetic ones

The single most common mistake teams make when building an eval suite is bootstrapping it with synthetic data. The synthetic cases follow whatever distribution the prompt designer had in mind; production traffic looks different. Synthetic suites produce green dashboards while production breaks on cases the suite never anticipated.

The Arize + Braintrust 2026 guidance converges on the same process:

  1. Capture every bad output. Production monitor flags low-confidence cases, user-reported errors, and any response that fails a sanity check. They flow into a queue.
  2. Curate weekly. A 30-minute review session. Label each case (what went wrong), deduplicate similar failures, decide which represent new patterns.
  3. Cluster, then pick one canonical example per failure mode. Adding 50 nearly-identical variants inflates the set without adding coverage; one well-chosen example per pattern keeps the suite tight.
  4. Version the set like code. The golden set lives in the repo, gets a commit hash, gets reviewed on PR. Changes to the set are themselves changes you can roll back if they turn out to be wrong.

The math works out heavily in favor of small + curated. A 200-row golden set sourced from real production failures covers maybe 80–90% of the failure modes your users will encounter next month. A 5,000-row synthetic set covers maybe 30% of the same modes, padded with thousands of cases that don’t resemble anything users actually do.

§ 03 · LLM-AS-JUDGE BIASESYour judge is a model. It has biases. Audit it.

For any task where the “correct” answer isn’t a single token to match — open-ended generation, summarization, dialog quality — teams use an LLM-as-judge to score the outputs. The judge is a second model with a rubric and a comparison task: given output A and output B, which one better satisfies the rubric? It works. It also has measurable, persistent biases — and the Autorubric paper (Rao + Callison-Burch, February 2026) documented two that appear in almost every default setup:

Mechanical fixes:

§ 04 · CROSS-MODEL DISAGREEMENT AS EVALWhere two frontier models disagree, you have your next test case

A pattern that emerged from production teams in early 2026: run each query through two or three frontier models simultaneously. Where they agree, ship the result. Where they disagree, escalate for human review — and add the disagreement to your golden set. Your eval suite builds itself from the cases the field identifies as genuinely uncertain.

Mechanically:

  1. Send the same input to Sonnet, GPT, and Gemini in parallel.
  2. Score agreement via semantic similarity of the outputs (embedding cosine works fine for most tasks).
  3. If agreement is above a confidence threshold (0.85+), ship the result with the cheapest of the three responses chosen.
  4. If agreement is below the threshold, escalate to a human reviewer and capture the case in the golden set.

Two things this gets you. A built-in quality signal per request — disagreement at runtime tells you the request is hard before you even have a result. And an organic golden-set growth process — your test suite grows toward exactly the cases that frontier models find hardest. The cost is the marginal expense of running three models in parallel, which is bounded and predictable.

§ 05 · VIBE CODING (THE ANTI-PATTERN)$40K spent, three months, shipped: zero

The opposite of eval-driven development is the pattern production teams started calling “vibe coding”: shipping AI features based on demo success instead of measured behavior. The team plays with the prompt for a week, the demo looks great, they ship. Production breaks in 48 hours. Three months later they’ve spent $40K on tokens, debugged the same five failure modes in fifteen different ways, and the feature is still not stable enough for general release.

The April 2026 postmortems from multiple early-stage AI startup teams all hit the same beats. The team felt confident. They skipped writing an eval because “the prompt was the product.” They couldn’t articulate what “done” meant in measurable terms. Every regression debate was a vibes debate. Eventually they wrote the eval and discovered every prompt change for the prior three months had regressed something.

The fix is grammatical: eval first. Code second. Vibes never. Write the eval before the prompt. Write the definition of done before the implementation. Ship the eval as part of the deliverable so the next change can be measured against it.

§ 06 · THE CI PIPELINE IN MOTIONPick a scenario, watch which stage catches it

Lab · CI pipeline for promptsPick a change scenario, watch which stage catches it

The engineer rephrased an instruction. It looks cleaner. It silently regresses 12% of the golden set on classification tasks. The PR gate catches it before merge.

stage 1
Local dev
100-row smoke set, runs on every prompt save (5 sec)
PASS
stage 2
PR check
500-row golden set, runs on every PR (45 sec)
BLOCK
stage 3
Deploy gate
Full audit set + cross-model agreement (4 min)
stage 4
Production monitor
Sampled live traffic, drift detection rolling 24h

Notice the asymmetry. The regression scenario costs <1 minute of CI time to catch in the PR check; if it slipped to production, the same bug would erode trust across thousands of user interactions and require an out-of-cycle hotfix. The dev stage is fast feedback, not a quality gate — it’s the PR check and deploy gate that actually block bad changes.

The lab makes the asymmetry of catching-early-versus-catching- late visible. A regression caught at PR check costs about 45 seconds of CI time. The same regression caught in production costs whatever your incident-response process costs, multiplied by however many user interactions ran through the broken prompt before someone noticed. The four-stage pipeline isn’t about making the eval thorough — it’s about catching problems at the cheapest possible point in their lifecycle.

CHECKYou're evaluating two summarization outputs with an LLM judge. Output A is 80 words, Output B is 240 words and largely repeats Output A's content with more detail. Your judge consistently picks B. What's most likely happening, and what's the cheapest fix?

§ · FURTHER READINGReferences & deeper sources

  1. Milind Nair (2026). Eval-Driven Development — A Four-Stage Pipeline for Shipping AI · Adaline + Braintrust Engineering
  2. Rao + Callison-Burch (2026). Autorubric — Position and Verbosity Biases in LLM-as-Judge · arXiv
  3. Arize (2026). Building Golden Datasets From Production Failures · Arize Blog
  4. Braintrust (2026). Production Evals: Capturing and Curating Real-World Failures · Braintrust Docs
  5. Cross-Model Disagreement WG (2026). Cross-Model Agreement as a Quality Signal · Spec, April 2026
  6. Various postmortems (2026). Vibe Coding: Why Demo-Driven AI Development Fails in Production · Engineering postmortems compilation

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.