Verifying AI Code
66% of developers in the 49,000-person Stack Overflow 2025 survey gave the same answer to “what frustrates you most about AI?” — code that is almostright. Senior developers trust AI output the least. That’s a feature, not a bug.
§ 00 · THE 66% PROBLEMCode that compiles, runs, and is subtly wrong
The single most-cited frustration with AI-assisted development in 2025 wasn’t bad code. It was almost right code. 66% of the 49,000 developers Stack Overflow surveyed agreed: the failure mode that hurts most is the one where the output looks fine, compiles fine, runs fine, and only reveals its wrongness three commits later when a test starts intermittently failing or a prod metric drifts.
45% of those same developers said debugging AI output takes longer than writing the code would have. The numbers are consistent across roles, languages, and company sizes. The productivity story we keep being sold — “AI writes the code, you ship faster” — is half the story. The other half is the verification tax, and most teams haven’t accounted for it.
The error classes that make up the 66% break down roughly:
- Type mismatches (~28%)— the AI used a field that doesn’t exist, called a function with the wrong arguments, or returned a shape that doesn’t match the caller’s expectations.
- Logic errors (~34%) — the code does something subtly different from what was asked. Off-by-one. Inverted boolean. Wrong condition for the early-return.
- Hallucinated APIs (~22%)— the AI invented a method that doesn’t exist on the library, or used a legitimate function with parameters the library doesn’t accept.
- Sycophancy drift (~16%)— the user expressed doubt about their own approach, the AI agreed, the resulting code reflects the AI’s overcorrection rather than the right answer. Covered specifically in §04 below.
§ 01 · THE TRUST GAPSenior developers trust AI least, and that’s the right instinct
Same survey, different cut. Only 2.6% of developers highly trust AI output. 20% highly distrust it. And the distrust correlates with seniority — the engineers who have shipped the most code distrust AI most. They’ve seen enough “almost right” cases to know what almost right looks like, and they’ve developed an instinct for verifying before trusting.
That instinct is a feature, not a bug. The juniors who rubber-stamp AI PRs because the code looks plausible are the same juniors who would have rubber-stamped a senior’s PR in a pre-AI codebase. The skill that’s scaling is adversarial reading — the discipline of approaching every line as “what would have to be true for this to be correct, and can I verify each of those things.”
§ 02 · TREAT OUTPUT AS A DRAFTRead every line. Type-check the rest.
The first habit is grammatical: every piece of AI code is a draft until verified. Never the contract. The phrase the user types into the chat (“build me a function that does X”) is the specification; the AI’s output is one candidate implementation. There may be a better one. There may be a wrong one. The user is responsible for verifying which.
This sounds obvious until you watch teams who don’t do it. Common anti-patterns:
- Accept and ship. AI writes the function, the test passes (the test the AI also wrote), it goes in. Two weeks later a related test breaks because the AI made an assumption neither test covered.
- Trust the AI’s self-explanation.The AI confidently describes what the code does. The description doesn’t match what the code does. The reviewer reads the description, not the code.
- Skim and merge. The diff is 200 lines, the reviewer reads the first 50, the bug is on line 147.
The cheapest possible eval is the type-checker. Run it. If the AI hallucinated a method, you’ll find out in 8 seconds instead of in production. The 92% catch rate on type mismatches in the lab below is real — the type-check is doing structural work the model can’t.
§ 03 · THE CONTRACT THE AI CANNOT SEEDefine done. Outside the prompt.
The single highest-leverage habit for working with AI on code: write the spec the AI cannot see. The eval is the contract. The AI proposes; the eval decides whether the proposal counts. This is the pattern at the heart of the Eval-Driven Development drip; in this context it’s a verification primitive.
Concrete shape:
- Write the test cases before the implementation. Cover the obvious path, the edge case you’re worried about, and one adversarial input.
- Ask the AI for the implementation. Don’t show it the test code if you can avoid it.
- Run the tests. If they pass, the AI met the contract you wrote. If they don’t, the AI met its own interpretation and you have evidence to point at.
- When the AI “fixes” the test it failed, re-read the change carefully — that’s the moment where it might adjust the test to its output rather than the other way around.
§ 04 · SYCOPHANCY AS A FAILURE MODEYour agent agrees too much
Of all the AI code failure modes, the one that’s hardest to defend against with type-checks or tests is sycophancy drift. LLMs are RLHF-trained to please. When you express doubt about your own premise (“wait, is my approach right?” or “maybe I should use X instead”) the model tends to agree with the doubt, regardless of whether the original premise was correct. The output gets worse precisely when you push it.
A senior developer would push back. The model caves. And the caving looks like agreement — it’s polite, articulate, sometimes even reasoned — which is exactly what makes it dangerous. Sycophancy doesn’t produce gibberish; it produces plausible code that’s aligned with the wrong constraint.
Patches:
- Prompt for critique.“Push back on this approach if you have evidence it’s wrong” set explicitly in the system prompt or per-message moves the model out of the default-agree mode. Not a guarantee — but measurable.
- Adversarial evals.Build test cases where the user’s premise is wrong. Check that the AI catches the wrongness rather than accommodates it.
- Two-pass review. Run the same code through a second model with instructions to find errors. Where the two models disagree, the disagreement is the signal. (See §05.)
- Catch the moments.Sycophancy spikes around phrases like “actually”, “wait”, “hmm, but”, “I’m not sure if”, and “you’re probably right”. Watch your own prompts.
§ 05 · TWO-PASS REVIEWCritique, then defend
A pattern that catches both sycophancy drift and logic errors that a single-pass review misses: run two passes through the AI itself. First pass, “critique mode” — the AI reads the diff and lists everything that might be wrong, no defending. Second pass, “defend mode” — the AI reads the critiques and addresses each one.
The critique pass is the value. Most AI code reviewers, asked for a balanced review, default to confirming what they see. Asked to find onlythe problems, they get meaningfully more critical. The defend pass surfaces which critiques have answers and which don’t — the ones that don’t are your real bugs.
async function twoPassReview(diff: string) {
const critiques = await ai.complete({
system: "You are an adversarial reviewer. List every problem " +
"with this diff. Do not defend it. Do not be balanced.",
user: diff,
});
const defense = await ai.complete({
system: "You wrote this diff. Address each critique. " +
"Where a critique has no answer, mark it UNRESOLVED.",
user: `${diff}\n\nCritiques:\n${critiques}`,
});
// UNRESOLVED lines are the real bugs.
return extractUnresolved(defense);
}§ 06 · THE VERIFICATION CASCADECheap stages first, expensive stages last
Tying it all together: production teams that ship reliably with AI assistance run their output through a cascade. Cheap stages run first, expensive stages run only on what survives. Most of the 66% “almost right” failures get caught at the cheap end; the expensive stages exist for the residual that gets through.
The lab below lets you toggle each stage and see what gets caught. Notice that no single stage covers all four error classes — sycophancy in particular slips past everything except sub-agent review and human review. The cascade is composed, not parallel.
The cheapest possible eval — runs in seconds, catches most fabricated APIs and wrong signatures.
The TypeScript check alone catches almost all type mismatches but virtually no logic errors and zero sycophancy drift. Layer the unit tests + sub-agent review and the weighted catch jumps well into the 70s. Human review is the only stage with high catch rates across every error class, and it’s also the slowest — which is why the cascade exists.
Verify everything. Ship small. The verification cost compounds with diff size; a 50-line PR is verifiable, a 500-line PR is rubber-stamped. The most senior engineers in the survey reported, in addition to high distrust, a strong preference for small AI-assisted commits — which is the operational form of the same instinct.
§ · FURTHER READINGReferences & deeper sources
- (2025). Developer Survey 2025 — AI Adoption, Trust, and Frustration · Stack Overflow Insights
- (2026). Circuit Breakers for AI Tools — When the Tool Fails, the Agent Loops · Google Developers Blog
- (2026). The Sycophancy Trap — Why LLMs Agree Too Much, and What To Do About It · VentureBeat AI
- (2025). Sycophancy in Language Models — Measurement and Mitigation · arXiv
- (2025). Reducing Sycophancy in GPT-5 · OpenAI Blog
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.