Drip · Engineering Practice · 14 min read

Eval-Driven Development

Frontier models saturated the old benchmarks. Production teams replaced them with a four-stage CI pipeline that treats prompts as code and catches regressions before they ship. The new gold standard for working with LLMs.

Brain Drip EditorsEvalEval. A run of your AI system against a known input set with a defined scoring function. The way you measure whether a prompt change improved or regressed quality. · Four-stage CI pipeline · May 2026

The bottom line. Milind Nair’s framework (March 2026), echoed by Braintrust’s eval-driven-development guidance, replaced static benchmarks with a four-stage continuous eval pipeline: local dev for fast feedback, PR check to block regressions, deploy gate as the pre-prod quality bar, and production monitor for drift detection. The eval set is sourced from real production failures, not synthetic data. LLM judges have measurable biases (verbosity, position) that need mechanical fixes. Cross-model disagreement becomes a bonus signal — where two frontier models disagree on the same input, you have your next golden-set case for free.

§ 00 · SHIP PROMPTS LIKE YOU SHIP CODEPrompts are code. Test them like code.

For most of 2023 and 2024, evaluating an LLM application meant running a benchmark — MMLU, BBH, MT-Bench, HumanEval — and eyeballing the number. By early 2026 that pattern was broken in two ways. Frontier models had saturated the older benchmarks; new models scored near-identically on the classic suites like MMLU and GSM8K, even as harder benchmarks emerged. And the benchmarks measured the model, not your application — a 92% on MMLU told you nothing about whether your support agent was getting better or worse.

Production teams replaced the benchmarks with a CI pipeline. The unit of evaluation is your prompt + your tools + your wiring, not the model in isolation. The eval runs in CI on every change. A regression on the golden set blocks the merge, the same way a broken test does. The discipline borrowed wholesale from software engineering, applied to the layer above the model.

§ 01 · THE FOUR-STAGE PIPELINEFast feedback, then quality gates, then production drift

Nair’s framing breaks the pipeline into four stages, each with a different speed/coverage tradeoff:

Local dev. A 50–100 row smoke set that runs in 5 seconds on every prompt save. Catches blatant breakage. The goal is fast feedback so the engineer iterates without context-switching. Not a quality gate— green here means “don’t bother committing if this fails.”
PR check. The 500-row golden set, run as a GitHub Action on every pull request. Takes 30–60 seconds. Blocks merge if the eval regresses below the configured threshold. This is the actual quality gate — the moment where “cleaner-looking prompt” has to prove it didn’t silently break anything.
Deploy gate. The full audit suite — golden set + cross-model agreement + judge audits — runs immediately before a production deploy. Slower (2–10 minutes) but comprehensive. Catches the rare regressions that snuck past the PR check.
Production monitor. Sampled live traffic continuously scored against the same rubric. Flags drift over rolling windows (24h is a common default). The data this stage generates feeds back into the golden set (see §02).

The four stages aren’t four different evals — they’re the same eval logic running at four different cost/coverage points. Build the eval once. Run it everywhere. Calibrate the row count per stage.

§ 02 · GOLDEN DATASETS FROM PRODUCTIONA few curated real failures beat thousands of synthetic ones

The single most common mistake teams make when building an eval suite is bootstrapping it with synthetic data. The synthetic cases follow whatever distribution the prompt designer had in mind; production traffic looks different. Synthetic suites produce green dashboards while production breaks on cases the suite never anticipated.

The Arize + Braintrust 2026 guidance converges on the same process:

Capture every bad output. Production monitor flags low-confidence cases, user-reported errors, and any response that fails a sanity check. They flow into a queue.
Curate weekly. A 30-minute review session. Label each case (what went wrong), deduplicate similar failures, decide which represent new patterns.
Cluster, then pick one canonical example per failure mode. Adding 50 nearly-identical variants inflates the set without adding coverage; one well-chosen example per pattern keeps the suite tight.
Version the set like code. The golden set lives in the repo, gets a commit hash, gets reviewed on PR. Changes to the set are themselves changes you can roll back if they turn out to be wrong.

The math works out heavily in favor of small + curated. A small golden set sourced from real production failures covers the large majority of the failure modes your users will encounter next month. A much larger synthetic set covers only a fraction of the same modes, padded with thousands of cases that don’t resemble anything users actually do. (The exact numbers are application-specific; what holds across teams is the direction — distinct real failures buy coverage that bulk synthetic rows do not.)

Sweep it yourself. The lab below sources a golden set from a mix of real and synthetic rows and shows how much of a fixed universe of forty distinct failure modes lights up — and what each run costs in CI time.

Lab · Golden-set coverageSweep size and real-vs-synthetic mix — watch how many of the 40 failure modes light up

Set size200 rows

Share sourced from real production failures80% real

Deduplicate near-identical cases

Failure-mode coverage

95%

38 / 40 modes

CI time per run

10s

~110 effective rows

CI seconds / coverage pt

0.1s

lower is leaner

Coverage comes from distinct real failure modes, not row count. A ~200-row mostly-real set with dedup on lights up nearly the whole grid in well under a minute; push the slider to 5,000 synthetic rows and coverage stalls around a third while CI time balloons into minutes. Illustrative model — the 40-mode universe and curve constants are tuned for the demo, not measured.

Coverage comes from distinct real failure modes, not row count: a ~200-row curated set lights up nearly the whole grid in seconds, while a 5,000-row synthetic set stalls around a third of it and balloons CI time.

§ 03 · LLM-AS-JUDGE BIASESYour judge is a model. It has biases. Audit it.

For any task where the “correct” answer isn’t a single token to match — open-ended generation, summarization, dialog quality — teams use an LLM-as-judge to score the outputs. The judge is a second model with a rubric and a comparison task: given output A and output B, which one better satisfies the rubric? It works. It also has measurable, persistent biases — and the canonical study of LLM-as-judge bias, Zheng et al.’s MT-Bench paper (NeurIPS 2023), documented biases that appear in almost every default setup, two of which dominate in practice:

Position bias. The judge tends to favor whichever answer it sees first. The bias is small but real and statistically significant.
Verbosity bias.The judge tends to favor longer answers, even when length doesn’t correlate with quality.

Mechanical fixes:

Shuffle the comparison. Run each comparison twice with positions A→B and B→A; average the scores. The directional position bias cancels out.
Tell the judge to ignore length.Add an explicit instruction in the rubric. Doesn’t fully eliminate verbosity bias but cuts it materially.
Multi-judge ensemble. Run three different judges with majority vote. Tends to wash out any single judge’s idiosyncrasies.
Calibrate to humans. Sample 100 cases, score them with both your judge and a human. Measure agreement. If agreement is low — a common practitioner floor is around 0.7–0.8, versus the ~80%+ that strong judges like GPT-4 reach against humans — your judge needs work before you trust its CI signal.

To feel why a raw judge score is untrustworthy, run the auditor below. Every one of the twelve comparison pairs is a genuine tie, so a fair judge should pick each answer half the time. Toggle the two mechanical fixes and watch the measured win-rate slide back toward a coin flip.

Lab · Judge bias auditorTwelve truly-equal answer pairs. Toggle the fixes, watch the judge return to a coin flip.

Verbosity gap (extra words on Answer B)+120 words

pair 1

first: B · longer: B

picks B 86%

pair 2

first: A · longer: B

picks B 57%

pair 3

first: B · longer: B

picks B 85%

pair 4

first: A · longer: B

picks B 63%

pair 5

first: B · longer: B

picks B 82%

pair 6

first: A · longer: B

picks B 60%

pair 7

first: B · longer: B

picks B 83%

pair 8

first: A · longer: B

picks B 62%

pair 9

first: B · longer: B

picks B 87%

pair 10

first: A · longer: B

picks B 58%

pair 11

first: B · longer: B

picks B 84%

pair 12

first: A · longer: B

picks B 59%

Measured win-rate · Answer B

72%

target band 45–55%

Judge verdict

BIASED

Every pair is a genuine tie, so a fair judge should land near 50%. Raw, the judge can be 70% confident in a coin flip — red rows are verbosity-driven, amber rows position-driven. Swap-and-average cancels the directional position term; the length-neutral rubric cuts (but doesn’t erase) the verbosity push; the ensemble smooths residual noise. With the two mechanical fixes on, B’s win-rate settles back into the 45–55% band. Illustrative model — coefficients are tuned for the demo, not measured.

A judge’s raw score can be 70% confident in a coin flip; two cheap mechanical fixes — swap-and-average for position, a length-neutral rubric for verbosity — pull it back to calibrated. Audit the judge before you trust its CI signal.

§ 04 · CROSS-MODEL DISAGREEMENT AS EVALWhere two frontier models disagree, you have your next test case

A pattern that emerged from production teams in early 2026: run each query through two or three frontier models simultaneously. Where they agree, ship the result. Where they disagree, escalate for human review — and add the disagreement to your golden set. Your eval suite builds itself from the cases the field identifies as genuinely uncertain.

Mechanically:

Send the same input to Sonnet, GPT, and Gemini in parallel.
Score agreement via semantic similarity of the outputs (embedding cosine works fine for most tasks).
If semantic agreement clears a tuned confidence threshold (0.85 is a reasonable starting point), ship the result with the cheapest of the three responses chosen.
If agreement is below the threshold, escalate to a human reviewer and capture the case in the golden set.

Two things this gets you. A built-in quality signal per request — disagreement at runtime tells you the request is hard before you even have a result. And an organic golden-set growth process — your test suite grows toward exactly the cases that frontier models find hardest. The cost is the marginal expense of running three models in parallel, which is bounded and predictable.

§ 05 · VIBE CODING (THE ANTI-PATTERN)Months of iteration, tens of thousands in tokens, shipped: nothing

The opposite of eval-driven development is the pattern Andrej Karpathy named “vibe coding” in early 2025: shipping AI features based on demo success instead of measured behavior. A representative failure: the team plays with the prompt for a week, the demo looks great, they ship. Production breaks in 48 hours. Months later they’ve burned tens of thousands of dollars in tokens, debugged the same five failure modes in fifteen different ways, and the feature is still not stable enough for general release.

The common pattern these failures share is consistent. The team felt confident. They skipped writing an eval because “the prompt was the product.” They couldn’t articulate what “done” meant in measurable terms. Every regression debate was a vibes debate. Eventually they wrote the eval and discovered every prompt change for the prior months had regressed something.

The fix is grammatical: eval first. Code second. Vibes never. Write the eval before the prompt. Write the definition of done before the implementation. Ship the eval as part of the deliverable so the next change can be measured against it.

§ 06 · THE CI PIPELINE IN MOTIONPick a scenario, watch which stage catches it

Lab · CI pipeline for promptsPick a change scenario, watch which stage catches it

The engineer rephrased an instruction. It looks cleaner. It silently regresses 12% of the golden set on classification tasks. The PR gate catches it before merge.

stage 1

Local dev

100-row smoke set, runs on every prompt save (5 sec)

PASS

stage 2

PR check

500-row golden set, runs on every PR (45 sec)

BLOCK

stage 3

Deploy gate

Full audit set + cross-model agreement (4 min)

—

stage 4

Production monitor

Sampled live traffic, drift detection rolling 24h

—

Notice the asymmetry. The regression scenario costs <1 minute of CI time to catch in the PR check; if it slipped to production, the same bug would erode trust across thousands of user interactions and require an out-of-cycle hotfix. The dev stage is fast feedback, not a quality gate — it’s the PR check and deploy gate that actually block bad changes.

The lab makes the asymmetry of catching-early-versus-catching- late visible. A regression caught at PR check costs about 45 seconds of CI time. The same regression caught in production costs whatever your incident-response process costs, multiplied by however many user interactions ran through the broken prompt before someone noticed. The four-stage pipeline isn’t about making the eval thorough — it’s about catching problems at the cheapest possible point in their lifecycle.

Plotted on a log scale, that asymmetry is a staircase: the cost of catching the same regression jumps by roughly an order of magnitude at every stage it survives.

Fig 1The cost of catching one regression rises by roughly an order of magnitude at each stage it slips past. The whole point of the four-stage pipeline is to fail the change at the leftmost, cheapest gate.

CHECKYou're evaluating two summarization outputs with an LLM judge. Output A is 80 words, Output B is 240 words and largely repeats Output A's content with more detail. Your judge consistently picks B. What's most likely happening, and what's the cheapest fix?

§ · FURTHER READINGReferences & deeper sources

Lianmin Zheng et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena · NeurIPS 2023 (arXiv:2306.05685)
Milind Nair (2026). LLM Evaluation in 2026 · Medium
Braintrust (2026). What Is Eval-Driven Development: How to Ship High-Quality Agents Without Guessing · Braintrust
Braintrust (2026). Scorers (LLM-as-a-Judge) · Braintrust Docs
Arize (2026). Golden Dataset: Role in Custom LLM Evals · Arize
promptfoo (2026). Testing Prompts with GitHub Actions · promptfoo Docs
OpenAI (2024). Evals — A Framework for Evaluating LLMs and LLM Systems · GitHub
LangSmith (2026). How to Define an LLM-as-a-Judge Evaluator · LangChain Docs
Andrej Karpathy (2025). The original “vibe coding” post (“forget that the code even exists”) · X, 2 Feb 2025

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.