The one knob that defines GRPO: the group

GRPO's baseline is the group mean — for each prompt it samples num_generations answers, scores them all, and an answer's advantage is its score minus the group's average. So the group size is the central hyperparameter. Too small (2–3) and the baseline is noisy; too large and each step is slow. Eight is the common starting point.

# config.py
from trl import GRPOConfig
 
def make_config(max_seq: int = 2048) -> GRPOConfig:
    return GRPOConfig(
        # --- the group ---
        num_generations=8,            # answers sampled per prompt
        # --- lengths ---
        max_prompt_length=256,
        max_completion_length=max_seq - 256,
        # --- optimization ---
        learning_rate=5e-6,           # RL wants a *small* LR
        adam_beta1=0.9,
        adam_beta2=0.99,
        weight_decay=0.1,
        warmup_ratio=0.1,
        lr_scheduler_type="cosine",
        optim="adamw_8bit",
        # --- batching (per device) ---
        per_device_train_batch_size=8,   # a multiple of num_generations
        gradient_accumulation_steps=1,
        # --- run length ---
        max_steps=300,                # a few hundred steps shows the lift
        logging_steps=1,
        save_steps=100,
        # --- generation temperature for exploration ---
        temperature=1.0,
        use_vllm=True,                # Unsloth's fast generation path
        output_dir="grpo-reasoner",
    )

Two things worth calling out. The learning rate is tiny (5e-6) — RL fine-tuning nudges the policy; a big LR collapses it. And per_device_train_batch_size should be a multiple of num_generations so groups stay whole within a batch.

The trainer

GRPOTrainer takes the model, the tokenizer, your reward functions (as a list — they're summed), and the dataset.

# train.py (part 1)
from trl import GRPOTrainer
from model import model, tokenizer
from data import build_dataset
from rewards import correctness_reward, format_reward
from config import make_config
 
trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=[format_reward, correctness_reward],  # summed per completion
    args=make_config(),
    train_dataset=build_dataset("train"),
)

Passing multiple reward functions is how you compose signals: here total reward per answer is format (0 or 0.5) + correctness (0 or 2.0). You can log and weight them separately, but the trainer just needs the list — it sums them into the scalar each answer's advantage is computed from.


Reference: TRL GRPOConfig · Unsloth GRPO notebook · RLVR — §03 GRPO (drip)