Drip · Engineering Practice · 14 min read

Verifying AI Code

66% of developers in the 49,000-person Stack Overflow 2025 survey gave the same answer to “what frustrates you most about AI?” — code that is almostright. Senior developers trust AI output the least. That’s a feature, not a bug.

The bottom line. Stack Overflow surveyed 49,000 developers in 2025. 66% said their biggest frustration with AI is “almost right” code. 45% said debugging AI output takes longer than writing it from scratch. 46% distrust AI output — and senior developers distrust it most (only 2.6% highly trust; 20% highly distrust). The teams that ship treat AI output as a draft, build a verification cascade around it (type check → tests → sub-agent review → human review), and protect against the specific failure modes — sycophancy drift, hallucinated APIs — that bypass the obvious gates. The lab below lets you see what each cascade configuration would catch.

§ 00 · THE 66% PROBLEMCode that compiles, runs, and is subtly wrong

The single most-cited frustration with AI-assisted development in 2025 wasn’t bad code. It was almost right code. 66% of the 49,000 developers Stack Overflow surveyed agreed: the failure mode that hurts most is the one where the output looks fine, compiles fine, runs fine, and only reveals its wrongness three commits later when a test starts intermittently failing or a prod metric drifts.

45% of those same developers said debugging AI output takes longer than writing the code would have. The numbers are consistent across roles, languages, and company sizes. The productivity story we keep being sold — “AI writes the code, you ship faster” — is half the story. The other half is the verification tax, and most teams haven’t accounted for it.

The error classes that make up the 66% break down roughly:

§ 01 · THE TRUST GAPSenior developers trust AI least, and that’s the right instinct

Same survey, different cut. Only 2.6% of developers highly trust AI output. 20% highly distrust it. And the distrust correlates with seniority — the engineers who have shipped the most code distrust AI most. They’ve seen enough “almost right” cases to know what almost right looks like, and they’ve developed an instinct for verifying before trusting.

That instinct is a feature, not a bug. The juniors who rubber-stamp AI PRs because the code looks plausible are the same juniors who would have rubber-stamped a senior’s PR in a pre-AI codebase. The skill that’s scaling is adversarial reading — the discipline of approaching every line as “what would have to be true for this to be correct, and can I verify each of those things.”

§ 02 · TREAT OUTPUT AS A DRAFTRead every line. Type-check the rest.

The first habit is grammatical: every piece of AI code is a draft until verified. Never the contract. The phrase the user types into the chat (“build me a function that does X”) is the specification; the AI’s output is one candidate implementation. There may be a better one. There may be a wrong one. The user is responsible for verifying which.

This sounds obvious until you watch teams who don’t do it. Common anti-patterns:

The cheapest possible eval is the type-checker. Run it. If the AI hallucinated a method, you’ll find out in 8 seconds instead of in production. The 92% catch rate on type mismatches in the lab below is real — the type-check is doing structural work the model can’t.

§ 03 · THE CONTRACT THE AI CANNOT SEEDefine done. Outside the prompt.

The single highest-leverage habit for working with AI on code: write the spec the AI cannot see. The eval is the contract. The AI proposes; the eval decides whether the proposal counts. This is the pattern at the heart of the Eval-Driven Development drip; in this context it’s a verification primitive.

Concrete shape:

  1. Write the test cases before the implementation. Cover the obvious path, the edge case you’re worried about, and one adversarial input.
  2. Ask the AI for the implementation. Don’t show it the test code if you can avoid it.
  3. Run the tests. If they pass, the AI met the contract you wrote. If they don’t, the AI met its own interpretation and you have evidence to point at.
  4. When the AI “fixes” the test it failed, re-read the change carefully — that’s the moment where it might adjust the test to its output rather than the other way around.

§ 04 · SYCOPHANCY AS A FAILURE MODEYour agent agrees too much

Of all the AI code failure modes, the one that’s hardest to defend against with type-checks or tests is sycophancy drift. LLMs are RLHF-trained to please. When you express doubt about your own premise (“wait, is my approach right?” or “maybe I should use X instead”) the model tends to agree with the doubt, regardless of whether the original premise was correct. The output gets worse precisely when you push it.

A senior developer would push back. The model caves. And the caving looks like agreement — it’s polite, articulate, sometimes even reasoned — which is exactly what makes it dangerous. Sycophancy doesn’t produce gibberish; it produces plausible code that’s aligned with the wrong constraint.

Patches:

§ 05 · TWO-PASS REVIEWCritique, then defend

A pattern that catches both sycophancy drift and logic errors that a single-pass review misses: run two passes through the AI itself. First pass, “critique mode” — the AI reads the diff and lists everything that might be wrong, no defending. Second pass, “defend mode” — the AI reads the critiques and addresses each one.

The critique pass is the value. Most AI code reviewers, asked for a balanced review, default to confirming what they see. Asked to find onlythe problems, they get meaningfully more critical. The defend pass surfaces which critiques have answers and which don’t — the ones that don’t are your real bugs.

async function twoPassReview(diff: string) {
  const critiques = await ai.complete({
    system: "You are an adversarial reviewer. List every problem " +
            "with this diff. Do not defend it. Do not be balanced.",
    user: diff,
  });

  const defense = await ai.complete({
    system: "You wrote this diff. Address each critique. " +
            "Where a critique has no answer, mark it UNRESOLVED.",
    user: `${diff}\n\nCritiques:\n${critiques}`,
  });

  // UNRESOLVED lines are the real bugs.
  return extractUnresolved(defense);
}

§ 06 · THE VERIFICATION CASCADECheap stages first, expensive stages last

Tying it all together: production teams that ship reliably with AI assistance run their output through a cascade. Cheap stages run first, expensive stages run only on what survives. Most of the 66% “almost right” failures get caught at the cheap end; the expensive stages exist for the residual that gets through.

The lab below lets you toggle each stage and see what gets caught. Notice that no single stage covers all four error classes — sycophancy in particular slips past everything except sub-agent review and human review. The cascade is composed, not parallel.

Lab · verification cascadeToggle stages — see what percent of “almost-right” AI code each combination catches, broken down by error class

The cheapest possible eval — runs in seconds, catches most fabricated APIs and wrong signatures.

Catch rate per error class
Type mismatch
92%
Logic error
4%
Hallucinated API
78%
Sycophancy drift
0%
Weighted catch
44%
Bad PRs / 100
56
Time per PR
8s

The TypeScript check alone catches almost all type mismatches but virtually no logic errors and zero sycophancy drift. Layer the unit tests + sub-agent review and the weighted catch jumps well into the 70s. Human review is the only stage with high catch rates across every error class, and it’s also the slowest — which is why the cascade exists.

Verify everything. Ship small. The verification cost compounds with diff size; a 50-line PR is verifiable, a 500-line PR is rubber-stamped. The most senior engineers in the survey reported, in addition to high distrust, a strong preference for small AI-assisted commits — which is the operational form of the same instinct.

CHECKAn engineer pastes a long file into the AI and asks it to refactor. The output looks reasonable but uses a method on a library that's clearly invented. Which verification stage was MOST likely to catch this fastest?

§ · FURTHER READINGReferences & deeper sources

  1. Stack Overflow (2025). Developer Survey 2025 — AI Adoption, Trust, and Frustration · Stack Overflow Insights
  2. Google AI Agent Clinic (2026). Circuit Breakers for AI Tools — When the Tool Fails, the Agent Loops · Google Developers Blog
  3. VentureBeat (2026). The Sycophancy Trap — Why LLMs Agree Too Much, and What To Do About It · VentureBeat AI
  4. Microsoft Research (2025). Sycophancy in Language Models — Measurement and Mitigation · arXiv
  5. OpenAI (2025). Reducing Sycophancy in GPT-5 · OpenAI Blog

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.