DPO: Direct Preference Optimization—an algorithm that optimizes a policy to satisfy preferences directly using a classification loss, skipping the explicit reward modeling and RL steps
RLHF: Reinforcement Learning from Human Feedback—a method to align models by training a reward model on preferences and then optimizing a policy to maximize that reward
PPO: Proximal Policy Optimization—a standard reinforcement learning algorithm used in RLHF to update the model policy while preventing it from changing too drastically
Bradley-Terry model: A statistical model that predicts the probability of preferring one item over another based on the difference in their latent 'rewards' or scores
KL divergence: A measure of how much one probability distribution differs from another; used here to constrain the tuned model to stay close to the original reference model
SFT: Supervised Fine-Tuning—the initial phase of training on high-quality demonstration data before preference learning begins
partition function: A normalizing constant in probability distributions (Z(x)) that usually makes direct optimization difficult; DPO mathematically cancels this term out
implicit reward: The reward value that is mathematically implied by the ratio of the optimized policy's log-probability to the reference policy's log-probability