Reward functions, not a reward model
In TRL, a GRPO reward function takes the batch of completions (and any pass-through dataset columns) and returns a list of floats — one score per completion. There is no learned reward model here. The "reward" is a few lines of Python that check the answer. That's the RLVR move made concrete.
Correctness — the verifier
Extract what's inside <answer>...</answer>, normalize it, and compare to the gold value. Match → high reward; miss → zero.
# rewards.py
import re
def extract_answer(text: str) -> str:
m = re.search(r"<answer>\s*(.*?)\s*</answer>", text, re.DOTALL)
if not m:
return ""
# keep it comparable: strip $, commas, whitespace
return m.group(1).strip().replace(",", "").replace("$", "")
def correctness_reward(completions, answer, **kwargs):
responses = [c[0]["content"] for c in completions]
got = [extract_answer(r) for r in responses]
return [2.0 if g == a else 0.0 for g, a in zip(got, answer)]answer arrives automatically — it's the dataset column from Step 3. This function is the verifier: deterministic, ungameable in the way a learned RM isn't, because it compares to ground truth.
Format — make the answer findable
A pure correctness reward is sparse early on, when the model rarely gets the format right and the verifier can't even locate an answer to grade. A small, cheap format reward bootstraps the structure so the correctness signal has something to bite on.
FORMAT_RE = re.compile(r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>", re.DOTALL)
def format_reward(completions, **kwargs):
responses = [c[0]["content"] for c in completions]
return [0.5 if FORMAT_RE.search(r) else 0.0 for r in responses]Keep the format reward small relative to correctness (0.5 vs 2.0). It's scaffolding — you want the model chasing correct, with format as a nudge, not the main prize. Over-weighting format is its own tiny reward hack: a model that produces beautiful empty structure.
Test the reward in isolation
Always unit-test a verifier before you train against it — a buggy reward trains a confident wrong model.
if __name__ == "__main__":
good = [[{"content": "<reasoning>2+2</reasoning>\n<answer>72</answer>"}]]
bad = [[{"content": "<reasoning>hmm</reasoning>\n<answer>13</answer>"}]]
print("correct:", correctness_reward(good, answer=["72"])) # [2.0]
print("wrong: ", correctness_reward(bad, answer=["72"])) # [0.0]
print("format: ", format_reward(good)) # [0.5]$ python rewards.py
correct: [2.0]
wrong: [0.0]
format: [0.5]That's the entire "reward model" — and because it's ground truth, the reward-hacking gap from the drip's lab never opens up (as long as the checker is honest, which for exact-match arithmetic it is).
Reference: TRL — using reward functions · RLVR & Process Rewards (drip) · Verifying AI Code (drip)