Build Along · Module 39·6 min build

Build: An AI Code Verifier

The companion build to Verifying AI Code. The single most useful pattern for AI-generated code, in ~60 lines: never accept the first draft on faith. Generate, verify, feed failures back, repeat.

What you’ll build

  • A mock code model that returns a buggy draft, then a fix.
  • A verifier that runs the candidate against tests in isolation.
  • A loop that hands failures back to the model until green — or escalates.
  • The lesson: the model writes; the verifier decides.

§ 00 · “ALMOST RIGHT”The expensive failure mode

Surveys keep finding the same thing: developers’ top AI frustration isn’t code that obviously breaks — it’s code that’s almost right. It compiles, it reads fine, and it’s subtly wrong. The fix isn’t a better model; it’s a verify loopverify loop. A control loop that generates a candidate, runs an automatic verifier (tests, types, lint), and if it fails, feeds the failure back to the model to try again — up to a bounded number of attempts before escalating to a human. that refuses to trust the first draft.

§ 01 · THE VERIFIERThe trust boundary

The verifier is the half of the loop that decides. It runs the candidate against tests and returns a readable verdict. In a real project this is your pytest run, type checker, and linter — ordered cheapest-first.

def check(source):
    ns = {}
    try:
        exec(source, ns)
    except Exception as e:
        return False, f"code did not even run: {e}"
    for call, expected in TESTS:
        got = eval(call, ns)
        if got != expected:
            return False, f"{call} -> {got!r}, expected {expected!r}"
    return True, f"all {len(TESTS)} tests passed"

§ 02 · THE LOOPHand the failure back

feedback = None
for attempt in range(MAX_ATTEMPTS):
    source = generate(TASK, feedback, attempt)
    passed, report = check(source)
    if passed:
        return source                 # accept
    feedback = report                 # the model gets to see why it failed
print("Escalate to a human.")

The key line is feedback = report. A real model reads that failing-test output and corrects course — the same way you would.

§ 03 · RUN ITRejected, then accepted

$ python verify_loop.py
--- Attempt 1 ---
def add_numbers(a, b):
    return str(a) + str(b)
verifier: add_numbers(2, 3) -> '23', expected 5

--- Attempt 2 ---
def add_numbers(a, b):
    return a + b
verifier: all 3 tests passed

Accepted after 2 attempt(s).

No human ever looked at the broken draft. That’s the win: a machine caught the “almost right” code, described the failure precisely, and the loop converged — with a bounded escape hatch to a human if it doesn’t.

CHECKWhat makes the verify loop trustworthy enough to run without a human watching every attempt?

§ · FURTHER READINGReferences & deeper sources

  1. Chen et al. (2022). CodeT: Code Generation with Generated Tests · arXiv
  2. Madaan et al. (2023). Self-Refine: Iterative Refinement with Self-Feedback · NeurIPS
  3. Olausson et al. (2024). Is Self-Repair a Silver Bullet for Code Generation? · ICLR

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.