← Back to Paper List

DPO Unchained: Your Training Algorithm is Secretly Disentangled in Human Choice Theory

Wenxuan Zhou, Shujian Zhang, Brice Magdalou, John Lambert, Ehsan Amid, Richard Nock, Andrew Hard
Google DeepMind, CEE-M, Montpellier U., Google Research
arXiv (2025)
RL P13N

📝 Paper Summary

Direct Preference Optimization (DPO) Reinforcement Learning from Human Feedback (RLHF) Loss Function Design
DPO's reliance on the Bradley-Terry-Luce model is unnecessary; a broader normative framework allows pairing any preference optimization algorithm with any human choice model, enabling non-convex losses and abstention.
Core Problem
Current preference optimization methods (like DPO) are rigidly tied to the Bradley-Terry-Luce (BTL) model, creating a theoretical 'straitjacket' that constrains algorithmic design and suggests only convex losses are valid.
Why it matters:
  • The assumption that ML algorithms must strictly adhere to specific human choice models (like BTL) limits innovation in loss function design
  • Researchers unknowingly restrict themselves to convex losses (e.g., logistic) because of this theoretical coupling, missing potential gains from non-convex objectives
  • Existing DPO extensions (margins, length normalization) lack a unified normative foundation to explain why they work or how to improve them
Concrete Example: In standard DPO, the reward function is forced to be a log-ratio of policies because of the BTL assumption. If a researcher wants to use a non-convex loss for better robustness, the BTL framework rejects it as theoretically unsound, even if it might perform better empirically.
Key Novelty
KLST* Framework (Generalizing DPO via Savage's Theory)
  • Replaces the BTL model with a generalized framework based on Savage's proper losses and Machina's lotteries, allowing for 'abstention' (refusing to choose) in the theoretical model
  • Decouples the loss function from the human choice model, proving that *any* valid analytical choice for training can be embedded with *any* human choice model
  • Unlocks the use of non-convex losses for preference optimization while retaining normative grounding
Evaluation Highlights
  • A toy non-convex loss (mixing exponential and concave shapes) achieves a 54.5% win rate against the standard exponential loss baseline on Alpaca Eval v2
  • Demonstrates that non-convex losses, previously theoretically discouraged in this context, can outperform convex baselines when supported by the new framework
  • The framework theoretically encompasses and validates existing DPO extensions like SimPO (margins) and ODIN (length normalization) as special cases of proper losses
Breakthrough Assessment
9/10
Foundational theoretical work that completely decouples preference optimization from specific choice models. It theoretically validates a vast design space (including non-convex losses) previously thought invalid.
×