GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs from the same prompt to estimate advantages without a value function
AdamW: A variant of the Adam optimizer that decouples weight decay from the gradient update, widely used in LLM training
Surrogate Objective: A proxy loss function used to approximate the true objective (reward maximization) locally, often using importance sampling
Importance Sampling: A technique to estimate properties of a target distribution using samples from a different proposal distribution (the old policy)
Trust Region: A constraint in optimization that prevents the new policy from moving too far from the old policy to ensure stability
Momentum: An optimizer feature that aggregates past gradients to accelerate convergence, which this paper shows can override clipping constraints
Shared Prefix: The sequence of tokens (prompt + early generation) that is identical across multiple samples in a group
KL Divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from another, used as a regularization penalty