← Back to Paper List

Why DPO is a Misspecified Estimator and How to Fix It

Aditya Gopalan, Sayak Ray Chowdhury, Debangshu Banerjee
HP AI Research
arXiv (2025)
RL

📝 Paper Summary

LLM Alignment Direct Preference Optimization (DPO) Reinforcement Learning from Human Feedback (RLHF)
DPO fails for parametric models because it forces a projection of the true reward onto a limited manifold, often causing preference reversals; AuxDPO fixes this by adding auxiliary reward variables.
Core Problem
DPO is derived assuming a tabular policy class with infinite capacity; when applied to parametric models (like neural networks) with finite capacity, it solves a misspecified estimation problem.
Why it matters:
  • DPO can decrease the expected reward of a policy below that of the base model, essentially unlearning alignment, even with infinite clean data.
  • The standard DPO loss leads to pathologies like preference order reversal and extreme sensitivity to the distribution of preference data (e.g., which pairs are compared most often).
  • Two-stage RLHF does not suffer from these specific geometric misspecification issues because it separates reward learning from policy optimization.
Concrete Example: In a simple 3-response scenario where the true reward favors response A > B > C, DPO with a linear policy can learn a policy that favors B > A simply because the dataset contains many more A vs C comparisons than A vs B comparisons, forcing the implicit reward vector into a bad projection.
Key Novelty
Auxiliary Variable Direct Preference Optimization (AuxDPO)
  • Identifies that DPO restricts the learned reward to a specific low-dimensional manifold defined by the policy's gradients.
  • Introduces learnable auxiliary scalar variables for each prompt-response pair in the loss function to decouple the reward modeling capability from the policy's parameter limits.
  • Allows the optimization to find a reward function closer to the 'true' RLHF solution by expanding the feasible reward space, then projecting back to the policy.
Evaluation Highlights
  • +8.0% win rate for AuxDPO over DPO on the UltraFeedback dataset using Llama-3-8B-Instruct (checking against base model).
  • Corrects preference reversals in didactic bandit experiments where standard DPO decreases expected reward below the base policy.
  • Outperforms DPO on varying data regimes, maintaining stability even when preference data distributions are skewed.
Breakthrough Assessment
8/10
Provides a rigorous theoretical explanation for known DPO instability (misspecification geometry) and proposes a mathematically grounded fix that works empirically. The insight about data distribution sensitivity is particularly valuable.
×