Step 7: Evaluate — Train a Reasoner with GRPO

Pass@1, the honest way

Reward curves going up is necessary but not sufficient — you could be overfitting the train split. The real test: greedy-decode one answer per held-out question and check it with the verifier. That's pass@1. Run it twice — base model (adapters disabled) and trained model (adapters enabled) — on the test split the model never saw.

# eval.py
from model import model, tokenizer
from data import build_dataset
from rewards import extract_answer
 
def pass_at_1(use_lora: bool, n: int = 200) -> float:
    ds = build_dataset("test").select(range(n))
    correct = 0
    for ex in ds:
        prompt = tokenizer.apply_chat_template(
            ex["prompt"], add_generation_prompt=True, tokenize=False
        )
        out = model.fast_generate(
            [prompt],
            sampling_params=dict(temperature=0.0, max_tokens=1024),
            lora_request=model.load_lora("grpo-reasoner/lora") if use_lora else None,
        )[0].outputs[0].text
        if extract_answer(out) == ex["answer"]:
            correct += 1
    return correct / n
 
if __name__ == "__main__":
    base = pass_at_1(use_lora=False)
    tuned = pass_at_1(use_lora=True)
    print(f"base  pass@1: {base:.1%}")
    print(f"tuned pass@1: {tuned:.1%}")
    print(f"lift: +{(tuned - base) * 100:.1f} pts")

$ python eval.py
base  pass@1: 41.5%
tuned pass@1: 58.0%
lift: +16.5 pts

Exact numbers depend on the base model and step count, but a clear double-digit lift on a 3–4B model is the typical, honest result — and it's measured with the same verifier used in training, now applied to unseen problems. Toggle use_lora and you're comparing the identical model with and without the RLVR adapters; nothing else changes.

Read a few completions

Numbers hide the interesting part. Print a base vs tuned answer to the same question:

# base: jumps to a number, often wrong
# <answer>18</answer>
 
# tuned: lays out the steps, lands it
# <reasoning>Natalia sold 48 in April. In May she sold half: 48/2 = 24.
# Total = 48 + 24 = 72.</reasoning>
# <answer>72</answer>

The tuned model reasons because reasoning is what earned reward. That behavioral change — not just the accuracy delta — is the thing to internalize.

Guard against fooling yourself

Eval on test, never train. We select from the test split above on purpose.
Use the same extractor as training. If eval parses answers differently than the reward did, you're measuring the parser, not the model.
pass@1 at temperature 0. Sampling inflates it; greedy is the honest single-shot number.

Reference: GSM8K · DeepSeek-R1 (pass@1 via RLVR) · Eval-Driven Development (drip)