Teach a small model to reason with RLVR — GRPO against a verifiable math reward using Unsloth + TRL, LoRA on a single GPU, with pass@1 measured before and after so you can watch the accuracy line move. Companion build to the RLVR & Process Rewards drip.