PPO: Proximal Policy Optimization—an RL algorithm that clips updates to prevent the policy from changing too drastically
GRPO: Group Relative Policy Optimization—a critic-free variant of PPO that normalizes rewards within a group of outputs for the same prompt
PA&LP: Positive-Advantage Low-Probability tokens—tokens that are good actions (positive advantage) but currently unlikely; their gradients encourage exploration
NA&LP: Negative-Advantage Low-Probability tokens—tokens that are bad actions (negative advantage) and currently unlikely; their gradients encourage exploitation (convergence)
stop gradient: An operation that prevents error signals from backpropagating through a specific part of the computation graph, used here to decouple the clipping condition from the gradient value
importance sampling ratio: The ratio of the probability of an action under the new policy vs. the old policy; used to estimate the new policy's value using old data
entropy collapse: A failure mode where the policy becomes deterministic too quickly, stopping exploration
DAPO: Decoupled Advantage Policy Optimization—a baseline method that extends the upper clipping bound to encourage exploration
AIME: American Invitational Mathematics Examination—a challenging math competition benchmark
avg@32: Evaluation metric averaging the score over 32 sampled responses per prompt