Step 6: Train — Train a Reasoner with GRPO

Run it

# train.py (part 2)
trainer.train()
 
# save the LoRA adapters
model.save_lora("grpo-reasoner/lora")
print("done — adapters in grpo-reasoner/lora")

$ python train.py
Step  1 | reward 0.51 | reward_std 0.42 | completion_len 88
Step 20 | reward 0.74 | reward_std 0.55 | completion_len 141
Step 60 | reward 1.38 | reward_std 0.71 | completion_len 190
Step 150| reward 1.92 | reward_std 0.63 | completion_len 205
Step 300| reward 2.14 | reward_std 0.55 | completion_len 212

On a T4 this is a few hours for 300 steps; on an L4/A100, well under one. It's slow because each step generates num_generations full answers per prompt — generation, not the gradient step, is the cost.

What to watch

Three signals, in order of importance:

Mean reward trends up. The headline. Early on it's dominated by the format reward (0.5s) as the model learns the structure; then the correctness reward (2.0s) starts landing and the mean climbs past 1.0 toward 2.0+. A flat reward means the task is too hard for the base model or the reward can't find the answer — go re-read your extract_answer.
reward_std stays healthy. GRPO needs variance within a group — if every answer in a group gets the same score, the advantage is zero and there's no gradient. A std that collapses to ~0 means the model has converged (or mode-collapsed); some spread is the training signal.
Completion length. Usually rises then plateaus as the model learns to actually work the problem, rather than either blurting an answer or rambling. A runaway length that never plateaus often means it's padding to game a length-sensitive reward — not our case here, but worth watching.

The "aha" you're looking for

The satisfying moment in an RLVR run is watching the model discover that showing its work pays off. Early completions guess; later ones lay out steps because steps lead to correct answers, and correct answers are the only thing the verifier rewards. Nobody told it to reason — the verifiable reward made reasoning the winning strategy. That's the whole thesis of the drip, happening in your logs.

If it stalls

Reward flat at ~0.5 → it's only getting the format reward; the base can't solve the problems or extract_answer isn't matching. Print a few completions and check what the regex sees.
Out of memory → drop num_generations to 4, max_completion_length lower, or gpu_memory_utilization down.
Reward spikes then crashes → LR too high; halve it.

Reference: Unsloth GRPO guide · TRL GRPOTrainer logs · DeepSeekMath / GRPO