The Goal

The companion drip, RLVR & Process Rewards, makes the argument: where correctness is checkable, you don't need a learned reward model — a verifier is a reward you can't game. This blueprint runs that loop end to end on a small model.

By the end you'll have:

  • A 4-bit base model + LoRA adapters loaded with Unsloth, trainable on one GPU.
  • A verifiable reward — extract the model's final answer, compare it to the gold answer from GSM8K. 1 if it matches, 0 if not. Plus a small format reward.
  • A GRPO training loop (TRL's GRPOTrainer) that samples a group of answers per question, scores them with the verifier, and pushes the policy toward the ones that check out.
  • A before/after eval: pass@1 on held-out grade-school math, so the lift is a number, not a vibe.

Why this is the whole RLVR recipe in miniature

Everything that makes 2026's reasoning models work is here, just small:

No learned reward model (the verifier is the reward). No value network (GRPO uses the group mean as the baseline). That's why this fits on one GPU.

What you'll need

ChoiceWhy
Unsloth2× faster, ~50% less VRAM LoRA training, with fast vLLM-backed generation built in — GRPO samples a lot, so generation speed dominates.
TRL GRPOTrainerThe reference GRPO implementation; you supply reward functions and it handles sampling, advantages, and the update.
A small instruct base (e.g. Qwen3-4B-Instruct or Llama-3.2-3B-Instruct)Big enough to sometimes solve GSM8K (RLVR sharpens latent ability), small enough to train on a T4/L4.
GSM8KGrade-school math word problems with clean gold answers — the canonical verifiable-reward dataset.

A reality check up front

RLVR sharpens what the base model can already sometimes do — it raises pass@1 toward pass@k. On a 3–4B model you'll see a real, measurable jump on GSM8K (often low-double-digit points), not GPT-4-level math. That's the honest scope of a single-GPU run, and it's enough to see the mechanism work — which is the point.

The companion repo

Runnable version: github.com/maraja/train-a-reasoner-with-grpo. Follow the blueprint or clone and run train.py / eval.py.

What's coming

Eight steps:

  1. What we're building (you're here)
  2. Setup — GPU, Unsloth, TRL, load a 4-bit base + LoRA
  3. The dataset — GSM8K, prompt format, keep the gold answers
  4. The verifiable reward — extract the answer, compare, plus a format reward
  5. GRPO config — group size, generation, the trainer
  6. Train — run it, and what to watch on the reward curve
  7. Evaluate — pass@1 before vs after
  8. What's next — process rewards, stronger verifiers, scaling, export

Reference: RLVR & Process Rewards (drip) · Unsloth GRPO guide · TRL GRPOTrainer · GSM8K