RLHF: Reinforcement Learning from Human Feedback—a method to align AI models by training a reward model on human preferences and optimizing a policy to maximize that reward
DPO: Direct Preference Optimization—an alignment method that optimizes the policy directly on preference data without training a separate reward model
PPO: Proximal Policy Optimization—an online reinforcement learning algorithm used in standard RLHF to update the model policy
ΨPO: Psi-PO—the unified objective function proposed in this paper, where different choices of the function Ψ recover algorithms like DPO, IPO, and KTO
Bradley-Terry model: A statistical model predicting the probability that one item is preferred over another based on the difference in their latent reward scores
IPO: Identity Preference Optimization—a DPO variant using a squared error loss to prevent overfitting to deterministic preferences
KTO: Kahneman-Tversky Optimization—an alignment method using binary 'good/bad' signals and a prospect theory-based loss instead of pairwise comparisons
SimPO: Simple Preference Optimization—a reference-free alignment method that uses length-normalized log-probabilities as implicit rewards
ORPO: Odds Ratio Preference Optimization—a method combining supervised fine-tuning and preference alignment into a single stage
SFT: Supervised Fine-Tuning—the initial training phase where a model learns to follow instructions from high-quality demonstrations
implicit KL: The regularization in DPO that is mathematically baked into the loss function rather than added as a separate penalty term
coverage: The extent to which the training data distribution overlaps with the high-reward regions of the response space
Nash Equilibrium: A state in game theory where no player (or policy) can gain by changing their strategy unilaterally; used here for non-transitive preferences