DPO: Direct Preference Optimization—an alignment method that optimizes the policy directly on preference data using a simple classification loss, bypassing the need for a separate reward model
RLHF: Reinforcement Learning from Human Feedback—the standard 3-stage alignment pipeline involving SFT, Reward Modeling, and PPO
PPO: Proximal Policy Optimization—an RL algorithm commonly used in RLHF to update the policy based on reward signals
Implicit Reward: The concept in DPO where the reward function is mathematically derived from the optimal policy and reference model, rather than being a separate trained neural network
KL Divergence: A statistical distance measure used to penalize the trained model for deviating too far from the reference (base) model
Reference Model: The initial supervised fine-tuned (SFT) model used as a baseline to prevent the optimized model from losing its linguistic capabilities during alignment
Reward Hacking: A phenomenon where the model learns to exploit flaws in the reward signal (e.g., generating very long responses) to get high scores without actually improving quality
Alignment Tax: The degradation of a model's performance on base tasks (e.g., calibration, reasoning) that occurs as a side effect of optimizing for alignment objectives
SFT: Supervised Fine-Tuning—the first stage of training where the model learns to follow instructions from high-quality demonstrations
Bradley-Terry Model: A statistical model that predicts the probability of one item being preferred over another based on their latent reward scores
Online DPO: Variants of DPO where preference data is generated and labeled iteratively during training, rather than using a static offline dataset