Pass@1, the honest way

Reward curves going up is necessary but not sufficient — you could be overfitting the train split. The real test: greedy-decode one answer per held-out question and check it with the verifier. That's pass@1. Run it twice — base model (adapters disabled) and trained model (adapters enabled) — on the test split the model never saw.

# eval.py
from model import model, tokenizer
from data import build_dataset
from rewards import extract_answer
 
def pass_at_1(use_lora: bool, n: int = 200) -> float:
    ds = build_dataset("test").select(range(n))
    correct = 0
    for ex in ds:
        prompt = tokenizer.apply_chat_template(
            ex["prompt"], add_generation_prompt=True, tokenize=False
        )
        out = model.fast_generate(
            [prompt],
            sampling_params=dict(temperature=0.0, max_tokens=1024),
            lora_request=model.load_lora("grpo-reasoner/lora") if use_lora else None,
        )[0].outputs[0].text
        if extract_answer(out) == ex["answer"]:
            correct += 1
    return correct / n
 
if __name__ == "__main__":
    base = pass_at_1(use_lora=False)
    tuned = pass_at_1(use_lora=True)
    print(f"base  pass@1: {base:.1%}")
    print(f"tuned pass@1: {tuned:.1%}")
    print(f"lift: +{(tuned - base) * 100:.1f} pts")
$ python eval.py
base  pass@1: 41.5%
tuned pass@1: 58.0%
lift: +16.5 pts

Exact numbers depend on the base model and step count, but a clear double-digit lift on a 3–4B model is the typical, honest result — and it's measured with the same verifier used in training, now applied to unseen problems. Toggle use_lora and you're comparing the identical model with and without the RLVR adapters; nothing else changes.

Read a few completions

Numbers hide the interesting part. Print a base vs tuned answer to the same question:

# base: jumps to a number, often wrong
# <answer>18</answer>
 
# tuned: lays out the steps, lands it
# <reasoning>Natalia sold 48 in April. In May she sold half: 48/2 = 24.
# Total = 48 + 24 = 72.</reasoning>
# <answer>72</answer>

The tuned model reasons because reasoning is what earned reward. That behavioral change — not just the accuracy delta — is the thing to internalize.

Guard against fooling yourself

  • Eval on test, never train. We select from the test split above on purpose.
  • Use the same extractor as training. If eval parses answers differently than the reward did, you're measuring the parser, not the model.
  • pass@1 at temperature 0. Sampling inflates it; greedy is the honest single-shot number.

Reference: GSM8K · DeepSeek-R1 (pass@1 via RLVR) · Eval-Driven Development (drip)