GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs generated from the same input to reduce variance
CoT: Chain of Thought—intermediate reasoning steps generated by a model before the final answer
EMA: Exponential Moving Average—a technique to update model weights slowly over time to create a stable reference model
KL divergence: Kullback-Leibler divergence—a statistical distance measuring how much one probability distribution differs from another; often used as a penalty to prevent model drift
SFT: Supervised Fine-Tuning—training a model on labeled examples before applying reinforcement learning
OOD: Out-of-Distribution—data that differs significantly from the training set (e.g., unseen environments or tasks)
Process Supervision: Training signals provided at each step of reasoning, rather than just for the final outcome
Sparse Bonus: A reward given only to a subset of high-performing samples (e.g., those above a threshold), rather than to all samples