Pass@1, the honest way
Reward curves going up is necessary but not sufficient — you could be overfitting the train split. The real test: greedy-decode one answer per held-out question and check it with the verifier. That's pass@1. Run it twice — base model (adapters disabled) and trained model (adapters enabled) — on the test split the model never saw.
# eval.py
from model import model, tokenizer
from data import build_dataset
from rewards import extract_answer
def pass_at_1(use_lora: bool, n: int = 200) -> float:
ds = build_dataset("test").select(range(n))
correct = 0
for ex in ds:
prompt = tokenizer.apply_chat_template(
ex["prompt"], add_generation_prompt=True, tokenize=False
)
out = model.fast_generate(
[prompt],
sampling_params=dict(temperature=0.0, max_tokens=1024),
lora_request=model.load_lora("grpo-reasoner/lora") if use_lora else None,
)[0].outputs[0].text
if extract_answer(out) == ex["answer"]:
correct += 1
return correct / n
if __name__ == "__main__":
base = pass_at_1(use_lora=False)
tuned = pass_at_1(use_lora=True)
print(f"base pass@1: {base:.1%}")
print(f"tuned pass@1: {tuned:.1%}")
print(f"lift: +{(tuned - base) * 100:.1f} pts")$ python eval.py
base pass@1: 41.5%
tuned pass@1: 58.0%
lift: +16.5 ptsExact numbers depend on the base model and step count, but a clear double-digit lift on a 3–4B model is the typical, honest result — and it's measured with the same verifier used in training, now applied to unseen problems. Toggle use_lora and you're comparing the identical model with and without the RLVR adapters; nothing else changes.
Read a few completions
Numbers hide the interesting part. Print a base vs tuned answer to the same question:
# base: jumps to a number, often wrong
# <answer>18</answer>
# tuned: lays out the steps, lands it
# <reasoning>Natalia sold 48 in April. In May she sold half: 48/2 = 24.
# Total = 48 + 24 = 72.</reasoning>
# <answer>72</answer>The tuned model reasons because reasoning is what earned reward. That behavioral change — not just the accuracy delta — is the thing to internalize.
Guard against fooling yourself
- Eval on
test, nevertrain. Weselectfrom the test split above on purpose. - Use the same extractor as training. If eval parses answers differently than the reward did, you're measuring the parser, not the model.
- pass@1 at temperature 0. Sampling inflates it; greedy is the honest single-shot number.
Reference: GSM8K · DeepSeek-R1 (pass@1 via RLVR) · Eval-Driven Development (drip)