GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing outputs within a group of samples for the same prompt, removing the critic
PPO: Proximal Policy Optimization—an RL algorithm using a clipped surrogate objective and a critic model to stabilize training
Critic Model: A separate neural network in RL that estimates the value (expected return) of a state, used to reduce variance in gradient estimation
Reference Model: A frozen copy of the initial policy used in RL (via KL divergence penalty) to prevent the trained model from drifting too far from its original behavior
KL divergence: Kullback-Leibler divergence—a statistical distance measuring how one probability distribution differs from another
SFT: Supervised Fine-Tuning—training on labeled input-output pairs
Advantage Function: A function quantifying how much better a specific action is compared to the average action in that state
AGE: Accurate Gradient Estimation—a technique proposed here to correct gradient scaling when a subset of samples in a group yields zero advantage (all same reward)