Eval-Driven Development
Frontier models saturated the old benchmarks. Production teams replaced them with a four-stage CI pipeline that treats prompts as code and catches regressions before they ship. The new gold standard for working with LLMs.
§ 00 · SHIP PROMPTS LIKE YOU SHIP CODEPrompts are code. Test them like code.
For most of 2023 and 2024, evaluating an LLM application meant running a benchmark — MMLU, BBH, MT-Bench, HumanEval — and eyeballing the number. By early 2026 that pattern was broken in two ways. Frontier models had saturated the benchmarks; new models scored the same as old models on every standard suite. And the benchmarks measured the model, not your application — a 92% on MMLU told you nothing about whether your support agent was getting better or worse.
Production teams replaced the benchmarks with a CI pipeline. The unit of evaluation is your prompt + your tools + your wiring, not the model in isolation. The eval runs in CI on every change. A regression on the golden set blocks the merge, the same way a broken test does. The discipline borrowed wholesale from software engineering, applied to the layer above the model.
§ 01 · THE FOUR-STAGE PIPELINEFast feedback, then quality gates, then production drift
Nair’s framing breaks the pipeline into four stages, each with a different speed/coverage tradeoff:
- Local dev. A 50–100 row smoke set that runs in 5 seconds on every prompt save. Catches blatant breakage. The goal is fast feedback so the engineer iterates without context-switching. Not a quality gate— green here means “don’t bother committing if this fails.”
- PR check.The 500-row golden set, run as a GitHub Action on every pull request. Takes 30–60 seconds. Blocks merge if the eval regresses below the configured threshold. This is the actual quality gate — the moment where “cleaner-looking prompt” has to prove it didn’t silently break anything.
- Deploy gate. The full audit suite — golden set + cross-model agreement + judge audits — runs immediately before a production deploy. Slower (2–10 minutes) but comprehensive. Catches the rare regressions that snuck past the PR check.
- Production monitor. Sampled live traffic continuously scored against the same rubric. Flags drift over rolling windows (24h is a common default). The data this stage generates feeds back into the golden set (see §02).
The four stages aren’t four different evals — they’re the same eval logic running at four different cost/coverage points. Build the eval once. Run it everywhere. Calibrate the row count per stage.
§ 02 · GOLDEN DATASETS FROM PRODUCTION200 real failures beat 5,000 synthetic ones
The single most common mistake teams make when building an eval suite is bootstrapping it with synthetic data. The synthetic cases follow whatever distribution the prompt designer had in mind; production traffic looks different. Synthetic suites produce green dashboards while production breaks on cases the suite never anticipated.
The Arize + Braintrust 2026 guidance converges on the same process:
- Capture every bad output. Production monitor flags low-confidence cases, user-reported errors, and any response that fails a sanity check. They flow into a queue.
- Curate weekly. A 30-minute review session. Label each case (what went wrong), deduplicate similar failures, decide which represent new patterns.
- Cluster, then pick one canonical example per failure mode. Adding 50 nearly-identical variants inflates the set without adding coverage; one well-chosen example per pattern keeps the suite tight.
- Version the set like code. The golden set lives in the repo, gets a commit hash, gets reviewed on PR. Changes to the set are themselves changes you can roll back if they turn out to be wrong.
The math works out heavily in favor of small + curated. A 200-row golden set sourced from real production failures covers maybe 80–90% of the failure modes your users will encounter next month. A 5,000-row synthetic set covers maybe 30% of the same modes, padded with thousands of cases that don’t resemble anything users actually do.
§ 03 · LLM-AS-JUDGE BIASESYour judge is a model. It has biases. Audit it.
For any task where the “correct” answer isn’t a single token to match — open-ended generation, summarization, dialog quality — teams use an LLM-as-judge to score the outputs. The judge is a second model with a rubric and a comparison task: given output A and output B, which one better satisfies the rubric? It works. It also has measurable, persistent biases — and the Autorubric paper (Rao + Callison-Burch, February 2026) documented two that appear in almost every default setup:
- Position bias. The judge tends to favor whichever answer it sees first. The bias is small but real and statistically significant.
- Verbosity bias.The judge tends to favor longer answers, even when length doesn’t correlate with quality.
Mechanical fixes:
- Shuffle the comparison. Run each comparison twice with positions A→B and B→A; average the scores. Position bias zeros out.
- Tell the judge to ignore length.Add an explicit instruction in the rubric. Doesn’t fully eliminate verbosity bias but cuts it materially.
- Multi-judge ensemble.Run three different judges with majority vote. Tends to wash out any single judge’s idiosyncrasies.
- Calibrate to humans. Sample 100 cases, score them with both your judge and a human. Measure agreement. If agreement is below ~0.7, your judge needs work before you trust its CI signal.
§ 04 · CROSS-MODEL DISAGREEMENT AS EVALWhere two frontier models disagree, you have your next test case
A pattern that emerged from production teams in early 2026: run each query through two or three frontier models simultaneously. Where they agree, ship the result. Where they disagree, escalate for human review — and add the disagreement to your golden set. Your eval suite builds itself from the cases the field identifies as genuinely uncertain.
Mechanically:
- Send the same input to Sonnet, GPT, and Gemini in parallel.
- Score agreement via semantic similarity of the outputs (embedding cosine works fine for most tasks).
- If agreement is above a confidence threshold (0.85+), ship the result with the cheapest of the three responses chosen.
- If agreement is below the threshold, escalate to a human reviewer and capture the case in the golden set.
Two things this gets you. A built-in quality signal per request — disagreement at runtime tells you the request is hard before you even have a result. And an organic golden-set growth process — your test suite grows toward exactly the cases that frontier models find hardest. The cost is the marginal expense of running three models in parallel, which is bounded and predictable.
§ 05 · VIBE CODING (THE ANTI-PATTERN)$40K spent, three months, shipped: zero
The opposite of eval-driven development is the pattern production teams started calling “vibe coding”: shipping AI features based on demo success instead of measured behavior. The team plays with the prompt for a week, the demo looks great, they ship. Production breaks in 48 hours. Three months later they’ve spent $40K on tokens, debugged the same five failure modes in fifteen different ways, and the feature is still not stable enough for general release.
The April 2026 postmortems from multiple early-stage AI startup teams all hit the same beats. The team felt confident. They skipped writing an eval because “the prompt was the product.” They couldn’t articulate what “done” meant in measurable terms. Every regression debate was a vibes debate. Eventually they wrote the eval and discovered every prompt change for the prior three months had regressed something.
The fix is grammatical: eval first. Code second. Vibes never. Write the eval before the prompt. Write the definition of done before the implementation. Ship the eval as part of the deliverable so the next change can be measured against it.
§ 06 · THE CI PIPELINE IN MOTIONPick a scenario, watch which stage catches it
The engineer rephrased an instruction. It looks cleaner. It silently regresses 12% of the golden set on classification tasks. The PR gate catches it before merge.
Notice the asymmetry. The regression scenario costs <1 minute of CI time to catch in the PR check; if it slipped to production, the same bug would erode trust across thousands of user interactions and require an out-of-cycle hotfix. The dev stage is fast feedback, not a quality gate — it’s the PR check and deploy gate that actually block bad changes.
The lab makes the asymmetry of catching-early-versus-catching- late visible. A regression caught at PR check costs about 45 seconds of CI time. The same regression caught in production costs whatever your incident-response process costs, multiplied by however many user interactions ran through the broken prompt before someone noticed. The four-stage pipeline isn’t about making the eval thorough — it’s about catching problems at the cheapest possible point in their lifecycle.
§ · FURTHER READINGReferences & deeper sources
- (2026). Eval-Driven Development — A Four-Stage Pipeline for Shipping AI · Adaline + Braintrust Engineering
- (2026). Autorubric — Position and Verbosity Biases in LLM-as-Judge · arXiv
- (2026). Building Golden Datasets From Production Failures · Arize Blog
- (2026). Production Evals: Capturing and Curating Real-World Failures · Braintrust Docs
- (2026). Cross-Model Agreement as a Quality Signal · Spec, April 2026
- (2026). Vibe Coding: Why Demo-Driven AI Development Fails in Production · Engineering postmortems compilation
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.