Why GSM8K
GSM8K is 8.5K grade-school math word problems, each with a worked solution ending in a clean final number after ####. That final number is the ground truth our verifier needs — no human preference labels, no reward model, just "did the model land on this value."
Load and reshape
We do two things: pull out the gold answer (the bit after ####), and format the prompt as a chat with a system message that asks for a parseable structure. Structure matters — the reward can only check an answer it can find.
# data.py
import re
from datasets import load_dataset
SYSTEM = (
"You are a careful math tutor. Reason step by step, then give the final "
"answer.\n"
"Respond in exactly this format:\n"
"<reasoning>\n...your working...\n</reasoning>\n"
"<answer>\nFINAL_NUMBER\n</answer>"
)
def gold_answer(solution: str) -> str:
# GSM8K gold answers come after '####'
return solution.split("####")[-1].strip().replace(",", "")
def build_dataset(split: str = "train"):
ds = load_dataset("openai/gsm8k", "main", split=split)
return ds.map(
lambda x: {
"prompt": [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": x["question"]},
],
"answer": gold_answer(x["answer"]),
}
)GRPOTrainer expects a prompt column (chat-formatted is fine); any other columns — here, answer — are passed straight through to your reward functions. That pass-through is how the verifier gets the gold value in Step 4.
Look at one
if __name__ == "__main__":
ds = build_dataset("train")
ex = ds[0]
print(ex["prompt"][1]["content"][:120], "...")
print("gold:", ex["answer"])$ python data.py
Natalia sold clips to 48 of her friends in April, and then she sold half as many ...
gold: 72The model never sees gold — it only sees the question and the format instruction. gold lives on the side, waiting for the verifier. That separation is the whole point: the model is graded against ground truth it can't peek at.
Reference: GSM8K dataset · TRL GRPO dataset format · RLVR & Process Rewards (drip)