DPO: Direct Preference Optimization—an offline method optimizing policy to satisfy preferences using a closed-form reward derived from the optimal policy and a reference model
SFT: Supervised Fine-Tuning—the initial phase of training a model on high-quality instruction-response pairs before preference alignment
Bradley-Terry model: A statistical model that predicts the probability of one item being preferred over another based on the difference in their underlying rewards or scores
SimPO: Simple Preference Optimization—the proposed reference-free algorithm using length-normalized average log-probability as reward
RLHF: Reinforcement Learning from Human Feedback—a generic framework for aligning models using human preference data
Partition function: A normalization factor in probability distributions (Z(x)), often intractable to compute directly
ORPO: Odds Ratio Preference Optimization—a recent reference-free objective that SimPO compares against
Length-controlled win rate: A metric (specifically in AlpacaEval 2) that adjusts win rates to account for the tendency of judges to prefer longer responses regardless of quality