← Back to Paper List

Direct Preference Optimization (DPO): Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn
Stanford University
arXiv, 5/2023 (2023)
RL P13N

📝 Paper Summary

LLM Alignment Reinforcement Learning from Human Feedback (RLHF)
DPO aligns language models to human preferences by solving the reinforcement learning objective analytically, replacing complex RL training loops with a simple classification loss.
Core Problem
Existing RLHF methods are complex and unstable, requiring a two-stage process (fitting a reward model, then training a policy via PPO) that involves sampling from the LM and extensive hyperparameter tuning.
Why it matters:
  • The standard RLHF pipeline is computationally expensive because it requires loading multiple models (policy, reference, reward, value) and sampling during training
  • PPO (Proximal Policy Optimization) is sensitive to hyperparameters and often unstable, leading to model degeneration or mode collapse
  • Precise control of large unsupervised LMs is necessary to avoid undesirable behaviors (e.g., hallucinations, bias) and steer them toward safe, high-quality outputs
Concrete Example: In standard RLHF, to teach a model to be less toxic, you first train a separate reward model to score toxicity, then run an RL loop where the main model generates text, gets scored, and updates. This loop is fragile. DPO skips the separate reward model and RL loop entirely.
Key Novelty
Direct Preference Optimization (DPO)
  • Leverages a mathematical change of variables to express the optimal reward function purely in terms of the optimal policy and a reference policy
  • Reformulates the RLHF objective (maximizing reward with a KL constraint) into a simple binary cross-entropy loss directly on preference pairs
  • Implicitly trains the reward function and the policy simultaneously in a single network, eliminating the need for a separate reward model or RL sampling
Evaluation Highlights
  • DPO exceeds PPO-based RLHF in controlling sentiment of generations (higher rewards at equivalent KL divergences)
  • Matches or improves response quality in summarization (TL;DR) and single-turn dialogue (Anthropic HH) compared to PPO
  • substantially simpler to implement and train, removing the need for sampling from the policy during the fine-tuning loop
Breakthrough Assessment
10/10
DPO has become the standard alternative to PPO for aligning open-source models due to its simplicity and stability. It theoretically unifies reward modeling and policy optimization.
×