GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same input, removing the need for a value function critic
RLVR: Reinforcement Learning with Verifiable Rewards—using ground-truth correctness (e.g., math answers) as the reward signal for RL training
Entropy collapse: A phenomenon where a model becomes overly confident (deterministic) during training, reducing diversity and causing instability
KL divergence: A penalty term often used in RL to keep the trained model close to the original reference model; often removed in reasoning tasks to allow more shift
Chain-of-Thought: Prompting technique where the model generates intermediate reasoning steps before the final answer
vLLM: A high-throughput library for LLM inference and serving
Decoupled clipping: A technique from DAPO where the PPO clipping range is separated for positive and negative advantages to better manage updates
DAPO: An RL algorithm (Diversity-Aware Policy Optimization) that modifies GRPO with decoupled clipping and other stability tricks
Pass@1: The probability that a single generated solution is correct
Policy entropy: A measure of the randomness in the model's token predictions; higher entropy means more diverse outputs