Step 8: What's Next — Train a Reasoner with GRPO

What you built

A small model fine-tuned with RLVR — GRPO against a verifiable reward, LoRA on a single GPU.
A reward that's a checker, not a model — extract the answer, compare to gold — so the reward-hacking gap from the drip's lab never opens.
A measured lift: pass@1 on held-out GSM8K, base vs tuned, with the same verifier used honestly on a test split.
The behavioral payoff: a model that reasons because reasoning earns reward — nobody wrote "show your work" into a loss.

Push it further

Export & serve. Merge the adapters and run the result as a normal model — model.save_pretrained_merged("reasoner", tokenizer), or export to GGUF and serve it via the Ollama blueprint.
Harder verifiers, harder data. Swap GSM8K for MATH, or code with a real test harness. The recipe is identical; only the checker changes. (For code, a weak test suite is a weak verifier — property tests and held-out cases keep it honest, exactly the failure the drip's quiz warns about.)
Add a process reward. Outcome-only reward is sparse. Bootstrap a process reward by rolling out many continuations from each step and scoring a step by how often it reaches a correct answer — dense credit assignment, no human step-labeling.
Scale the group. More num_generations gives a lower-variance baseline (better gradients) at the cost of generation time — the first thing to turn up when you have more GPU.
Hybrid rewards. For tasks that are only partly checkable, gate a small learned reward model behind the verifier — reward only what passes the check, then let the RM rank among the survivors.

The one thing to remember

RLVR's power and its fence are the same word: verifiable. Where you can cheaply and honestly check correctness, you can delete the most gameable part of the RLHF stack and train reasoning with a few lines of Python. Where you can't, you're back to proxies — so the highest-leverage work is often building a better checker, not a better reward model.

Check your checker, and the rest of the loop mostly takes care of itself.

Reference: RLVR & Process Rewards (drip) · Let's Verify Step by Step (PRMs) · Tülu 3 (open RLVR recipe) · Run an Open Model with Ollama (blueprint)