DPO: Direct Preference Optimization—an alignment method that optimizes the policy directly on preference data without an explicit reinforcement learning loop
IPO: Identity Preference Optimization—a DPO variant using a squared-error loss to prevent overfitting by regularizing towards a uniform target margin
MADPO: Margin-Adaptive Direct Preference Optimization—the proposed method that re-weights DPO loss based on explicit reward margins
BTL: Bradley-Terry-Luce—a statistical model estimating the probability of one item being preferred over another based on their score difference
Implicit Reward Margin: The log-probability ratio difference between the winning and losing responses under the current policy
Explicit Reward Margin: The reward difference predicted by a separate, trained reward model
beta: A temperature hyperparameter controlling the strength of the KL regularization penalty (or confidence) in the DPO objective
beta-DPO: A DPO variant that adapts the beta parameter at the batch level