Build Along · Module 37·6 min build

Build: An Eval Harness

The companion build to Eval-Driven Development. The smallest honest eval suite: a dataset of cases, a system under test, scorers, and a pass/fail gate that returns a non-zero exit code so it runs in CI on every prompt change.

What you’ll build

  • A dataset.jsonl of labeled cases.
  • A system under test you can swap for a real model call.
  • Scorers — exact match, substring, and where LLM-as-judge fits.
  • A CI gate: print a pass-rate table, exit non-zero below threshold.

§ 00 · PROMPTS ARE CODESo test them like code

A prompt change is a code change with no type checker and no test suite — unless you build one. An evaleval. A dataset of inputs with expected outputs, plus scorers that decide pass/fail, run automatically against a model or prompt. The unit test of LLM applications. is that test suite. The discipline is simple: change the prompt, run the eval, watch the number.

§ 01 · A DATASET OF CASESOne labeled example per line

{"input": "The shipping was fast but the product broke in a week.", "expected": "negative"}
{"input": "Absolutely love it, best purchase this year!", "expected": "positive"}
{"input": "It's fine. Does the job, nothing special.", "expected": "neutral"}
{"input": "Not bad at all, honestly impressed.", "expected": "positive"}

That last line is the interesting one. Keyword matching reads “bad” and calls it negative — a classic almost right failure that an eval catches and a spot-check misses.

§ 02 · SCORERSHow you decide an output is right

def exact_match(prediction, expected):
    return prediction.strip().lower() == expected.strip().lower()

def includes(prediction, expected):
    # pass if the expected answer appears anywhere in the prediction
    return expected.strip().lower() in prediction.strip().lower()

# LLM-as-judge lives here too — same signature, body calls a model:
#   "Does PREDICTION satisfy EXPECTED? yes/no"

Different tasks need different scorers: exact match for classification, substring or regex for extraction, LLM-as-judge for open-ended text. They all share one signature, so the runner doesn’t care which you use.

§ 03 · THE GATEA number CI can act on

rate = passed / len(cases)
print(f"Pass rate: {passed}/{len(cases)} = {rate:.0%}")
if rate < THRESHOLD:            # e.g. 0.90
    sys.exit(1)                 # fail the build
$ python run.py
FAIL    positive   negative   Not bad at all, honestly impressed
Pass rate: 6/7 = 86%  (threshold 90%)
RESULT: FAIL — below threshold, build should not ship.
$ echo $?
1

Now improve the system (or point it at a real model) until the gate goes green. That non-zero exit code is the whole point: the eval is a gate a machine enforces, not a chart a human glances at.

CHECKWhy return a non-zero exit code from the eval runner instead of just printing the pass rate?

§ · FURTHER READINGReferences & deeper sources

  1. OpenAI (2023). Evals — a framework for evaluating LLMs · GitHub
  2. Anthropic (2025). Building evals and test cases · Anthropic Docs
  3. Hugging Face (2024). lighteval — fast LLM evaluation · GitHub

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.