GRPO: Group Relative Policy Optimization—a method that optimizes policy advantage by standardizing reward scores across a group of samples, removing the need for a value function critic.
DPO: Direct Preference Optimization—a method that implicitly optimizes a reward function by training on preference pairs using a closed-form solution to the KL-constrained objective.
Implicit Reward: The reward value implied by the ratio of the current policy probability to the reference policy probability.
Partition Function: A normalizing constant (Z(x)) in probability distributions that sums over all possible outcomes; usually computationally intractable to calculate for LLMs.
Importance Sampling: A technique to estimate properties of a target distribution while sampling from a different proposal distribution, often using likelihood ratios as weights.
KL-constrained reward maximization: Optimizing a policy to maximize expected reward while keeping the policy distribution close (low Kullback-Leibler divergence) to a reference policy.
Off-policy training: Training a policy using data generated by a different behavior policy (e.g., an older version of the model or a static dataset).