Build: An Eval Harness
The companion build to Eval-Driven Development. The smallest honest eval suite: a dataset of cases, a system under test, scorers, and a pass/fail gate that returns a non-zero exit code so it runs in CI on every prompt change.
What you’ll build
- A
dataset.jsonlof labeled cases. - A system under test you can swap for a real model call.
- Scorers — exact match, substring, and where LLM-as-judge fits.
- A CI gate: print a pass-rate table, exit non-zero below threshold.
§ 00 · PROMPTS ARE CODESo test them like code
A prompt change is a code change with no type checker and no test suite — unless you build one. An evaleval. A dataset of inputs with expected outputs, plus scorers that decide pass/fail, run automatically against a model or prompt. The unit test of LLM applications. is that test suite. The discipline is simple: change the prompt, run the eval, watch the number.
§ 01 · A DATASET OF CASESOne labeled example per line
{"input": "The shipping was fast but the product broke in a week.", "expected": "negative"}
{"input": "Absolutely love it, best purchase this year!", "expected": "positive"}
{"input": "It's fine. Does the job, nothing special.", "expected": "neutral"}
{"input": "Not bad at all, honestly impressed.", "expected": "positive"}That last line is the interesting one. Keyword matching reads “bad” and calls it negative — a classic almost right failure that an eval catches and a spot-check misses.
§ 02 · SCORERSHow you decide an output is right
def exact_match(prediction, expected):
return prediction.strip().lower() == expected.strip().lower()
def includes(prediction, expected):
# pass if the expected answer appears anywhere in the prediction
return expected.strip().lower() in prediction.strip().lower()
# LLM-as-judge lives here too — same signature, body calls a model:
# "Does PREDICTION satisfy EXPECTED? yes/no"Different tasks need different scorers: exact match for classification, substring or regex for extraction, LLM-as-judge for open-ended text. They all share one signature, so the runner doesn’t care which you use.
§ 03 · THE GATEA number CI can act on
rate = passed / len(cases)
print(f"Pass rate: {passed}/{len(cases)} = {rate:.0%}")
if rate < THRESHOLD: # e.g. 0.90
sys.exit(1) # fail the build$ python run.py FAIL positive negative Not bad at all, honestly impressed Pass rate: 6/7 = 86% (threshold 90%) RESULT: FAIL — below threshold, build should not ship. $ echo $? 1
Now improve the system (or point it at a real model) until the gate goes green. That non-zero exit code is the whole point: the eval is a gate a machine enforces, not a chart a human glances at.
§ · FURTHER READINGReferences & deeper sources
- (2023). Evals — a framework for evaluating LLMs · GitHub
- (2025). Building evals and test cases · Anthropic Docs
- (2024). lighteval — fast LLM evaluation · GitHub
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.