Step 2: Setup — Train a Reasoner with GRPO

Install

pip install unsloth vllm
pip install "trl>=0.14" datasets

Unsloth pins compatible versions of torch, transformers, trl, and peft, so install it first and let it resolve the stack. On Colab/Kaggle, restart the runtime after install.

Load the base model

We load a small instruct model in 4-bit, and — importantly for GRPO — turn on fast inference. GRPO spends most of its wall-clock generating samples, so vLLM-backed generation is what makes this tractable on one GPU.

# model.py
from unsloth import FastLanguageModel
 
MAX_SEQ = 2048
 
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-4B-Instruct",   # or unsloth/Llama-3.2-3B-Instruct
    max_seq_length=MAX_SEQ,
    load_in_4bit=True,
    fast_inference=True,          # vLLM engine for fast group sampling
    gpu_memory_utilization=0.6,   # leave room for training + KV cache
)

Attach LoRA

We don't train the full model — we train LoRA adapters, a few million parameters instead of billions. That's what keeps a 4B model trainable on a 16GB card. (If LoRA is new, the LoRA & qLoRA drip is the background.)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=32,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    use_gradient_checkpointing="unsloth",   # extra VRAM savings
    random_state=3407,
)

Sanity check

if __name__ == "__main__":
    from transformers import TextStreamer
    msgs = [{"role": "user", "content": "What is 17 * 24? Think, then answer."}]
    inputs = tokenizer.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to("cuda")
    model.generate(inputs, max_new_tokens=256, streamer=TextStreamer(tokenizer))

You should get a coherent (if not always correct) attempt. That "sometimes right" baseline is exactly what GRPO will sharpen — the base already has the ability; we're going to reward the runs that land it.

Reference: Unsloth install · Unsloth GRPO guide · LoRA & qLoRA (drip)