GRPO: Group Relative Policy Optimization—an RL algorithm that estimates baselines by averaging rewards within a group of sampled outputs rather than using a separate value model
Importance Sampling Ratio: The ratio between the probability of a token under the current policy versus the old policy; used to correct for off-policy data
Geometric Mean: A type of mean calculated by multiplying N numbers and taking the Nth root; less sensitive to extreme outliers than the arithmetic mean
Pass@1: The percentage of problems where the model generates the correct answer in its first attempt
KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a second, reference probability distribution
PPO: Proximal Policy Optimization—an RL algorithm that constrains policy updates to be within a small range to ensure stability
Token Entropy: A measure of the randomness or uncertainty in the model's token predictions; higher entropy generally indicates more exploration
Chain-of-Thought: A prompting technique where the model generates intermediate reasoning steps before the final answer