RPG: Regularized Policy Gradient—a framework deriving exact gradients for KL-regularized objectives under off-policy sampling
GRPO: Group Relative Policy Optimization—a PPO variant that normalizes advantages within a group of outputs for the same prompt, typically without a value network
k3 estimator: A specific estimator for KL divergence (y - log y - 1) used in PPO and GRPO, which RPG proves is equivalent to Unnormalized KL
UKL: Unnormalized KL Divergence—a generalized KL formulation that accounts for probability distributions that do not sum to 1
Importance Sampling: A technique to estimate properties of a target distribution while sampling from a different (proposal) distribution by weighting samples by the ratio of their probabilities
REINFORCE: A fundamental policy gradient algorithm that updates policies based on the return of complete trajectories
DAPO: Direct Alignment Policy Optimization—a recent baseline algorithm for aligning LLMs
SFT: Supervised Fine-Tuning—training on labeled data before RL
Dual-Clip: A clipping strategy that bounds importance weights differently depending on whether the advantage is positive or negative