Step 3: The Dataset — Train a Reasoner with GRPO

Why GSM8K

GSM8K is 8.5K grade-school math word problems, each with a worked solution ending in a clean final number after ####. That final number is the ground truth our verifier needs — no human preference labels, no reward model, just "did the model land on this value."

Load and reshape

We do two things: pull out the gold answer (the bit after ####), and format the prompt as a chat with a system message that asks for a parseable structure. Structure matters — the reward can only check an answer it can find.

# data.py
import re
from datasets import load_dataset
 
SYSTEM = (
    "You are a careful math tutor. Reason step by step, then give the final "
    "answer.\n"
    "Respond in exactly this format:\n"
    "<reasoning>\n...your working...\n</reasoning>\n"
    "<answer>\nFINAL_NUMBER\n</answer>"
)
 
def gold_answer(solution: str) -> str:
    # GSM8K gold answers come after '####'
    return solution.split("####")[-1].strip().replace(",", "")
 
def build_dataset(split: str = "train"):
    ds = load_dataset("openai/gsm8k", "main", split=split)
    return ds.map(
        lambda x: {
            "prompt": [
                {"role": "system", "content": SYSTEM},
                {"role": "user", "content": x["question"]},
            ],
            "answer": gold_answer(x["answer"]),
        }
    )

GRPOTrainer expects a prompt column (chat-formatted is fine); any other columns — here, answer — are passed straight through to your reward functions. That pass-through is how the verifier gets the gold value in Step 4.

Look at one

if __name__ == "__main__":
    ds = build_dataset("train")
    ex = ds[0]
    print(ex["prompt"][1]["content"][:120], "...")
    print("gold:", ex["answer"])

$ python data.py
Natalia sold clips to 48 of her friends in April, and then she sold half as many ...
gold: 72

The model never sees gold — it only sees the question and the format instruction. gold lives on the side, waiting for the verifier. That separation is the whole point: the model is graded against ground truth it can't peek at.

Reference: GSM8K dataset · TRL GRPO dataset format · RLVR & Process Rewards (drip)