← Back to Paper List

Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO

Ruizhe Shi, Minhak Song, Runlong Zhou, Zihan Zhang, Maryam Fazel, Simon S. Du
University of Washington
arXiv (2025)
RL P13N

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Direct Preference Optimization (DPO) Theoretical Analysis of Preference Learning
RLHF and DPO are not equivalent when models are mis-specified or data is limited; RLHF is more sample-efficient for sparse rewards, while DPO is robust against reward mis-specification but struggles with policy mis-specification.
Core Problem
Common theoretical analyses assume 'realizability' (the true reward/policy fits in the model class), implying RLHF and DPO are equivalent, but this assumption fails in practice due to limited model capacity and data.
Why it matters:
  • Practitioners often observe performance differences between RLHF and DPO that standard theory cannot explain
  • Reward models are often much smaller than policy models (e.g., 6B vs 175B), creating significant representation gaps
  • Understanding when to use DPO vs. RLHF is critical for efficient LLM alignment under resource constraints
Concrete Example: In a scenario where the ground-truth reward depends on sparse features (e.g., only a few keywords matter) but data is limited, DPO requires significantly more samples to learn effectively compared to RLHF, which can leverage the sparsity in the reward modeling stage.
Key Novelty
Decomposition of Performance Gap into Representation and Statistical Sources
  • Analyzes 'Exact Optimization' (infinite data) to show how mis-specification in reward vs. policy classes determines whether RLHF or DPO is superior
  • Analyzes 'Approximate Optimization' (finite data) to prove RLHF has a statistical advantage (sample efficiency) over DPO when the ground-truth reward is sparse
Evaluation Highlights
  • RLHF recovers the optimal policy with O(sqrt(k log d / n)) error for sparse rewards (dimension d, sparsity k), whereas DPO error scales with O(d/n), making RLHF significantly more sample-efficient
  • Under policy mis-specification (optimal policy not in model class), RLHF yields a strictly superior policy to DPO even with infinite data
  • Under reward mis-specification (true reward not in model class), DPO is strictly superior because it bypasses the flawed reward modeling stage
Breakthrough Assessment
8/10
Provides a rigorous theoretical foundation explaining the empirical 'vibes' about when DPO vs RLHF works better, challenging the common assumption of their equivalence.
×