DPO: Direct Preference Optimization—a method to align language models to preferences without training an explicit reward model
Reward Hacking: When an AI optimizes a proxy reward function (the learned model) at the expense of the true objective
Type I Reward Hacking: When the model overestimates the value of a subpar action due to statistical noise in sparse data
Type II Reward Hacking: When the model underestimates the value of a good action due to statistical noise, leading to deterioration of the initial policy
Weighted Entropy: An entropy measure that weights outcomes by their importance or coverage, used here to focus learning on well-supported data regions
Dynamic Labels: A technique where training labels are soft-updated based on the model's current confidence to downweight noisy/outlier samples
SimPO: Simple Preference Optimization—a DPO variant that removes the reference policy from the objective
IPO: Identity Preference Optimization—a DPO variant using a non-linear loss to control overfitting
SFT: Supervised Fine-Tuning—training on high-quality demonstrations before preference alignment