What you built
- A small model fine-tuned with RLVR — GRPO against a verifiable reward, LoRA on a single GPU.
- A reward that's a checker, not a model — extract the answer, compare to gold — so the reward-hacking gap from the drip's lab never opens.
- A measured lift: pass@1 on held-out GSM8K, base vs tuned, with the same verifier used honestly on a test split.
- The behavioral payoff: a model that reasons because reasoning earns reward — nobody wrote "show your work" into a loss.
Push it further
- Export & serve. Merge the adapters and run the result as a normal model —
model.save_pretrained_merged("reasoner", tokenizer), or export to GGUF and serve it via the Ollama blueprint. - Harder verifiers, harder data. Swap GSM8K for MATH, or code with a real test harness. The recipe is identical; only the checker changes. (For code, a weak test suite is a weak verifier — property tests and held-out cases keep it honest, exactly the failure the drip's quiz warns about.)
- Add a process reward. Outcome-only reward is sparse. Bootstrap a process reward by rolling out many continuations from each step and scoring a step by how often it reaches a correct answer — dense credit assignment, no human step-labeling.
- Scale the group. More
num_generationsgives a lower-variance baseline (better gradients) at the cost of generation time — the first thing to turn up when you have more GPU. - Hybrid rewards. For tasks that are only partly checkable, gate a small learned reward model behind the verifier — reward only what passes the check, then let the RM rank among the survivors.
The one thing to remember
RLVR's power and its fence are the same word: verifiable. Where you can cheaply and honestly check correctness, you can delete the most gameable part of the RLHF stack and train reasoning with a few lines of Python. Where you can't, you're back to proxies — so the highest-leverage work is often building a better checker, not a better reward model.
Check your checker, and the rest of the loop mostly takes care of itself.
Reference: RLVR & Process Rewards (drip) · Let's Verify Step by Step (PRMs) · Tülu 3 (open RLVR recipe) · Run an Open Model with Ollama (blueprint)