← Back to Paper List

The Importance of Online Data: Understanding Preference Fine-tuning via Coverage

Yuda Song, Gokul Swamy, Aarti Singh, J. Bagnell, Wen Sun
Carnegie Mellon University, Cornell University
Neural Information Processing Systems (2024)
RL P13N Benchmark

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Preference Fine-tuning
The paper proves that offline contrastive methods like DPO require stricter global data coverage to succeed, whereas online RLHF needs only local coverage, motivating a new hybrid algorithm (HyPO) that combines both strengths.
Core Problem
Offline contrastive methods (like DPO) are often treated as theoretically equivalent to online RLHF, but empirically, online methods frequently outperform them, suggesting a missing theoretical distinction.
Why it matters:
  • Recent empirical studies show online methods consistently beating offline ones, contradicting early claims of equivalence
  • Purely offline methods can fail catastrophically when the preference dataset lacks diversity (poor coverage) compared to the optimal policy's trajectory
  • Understanding this gap is crucial for designing algorithms that are both computationally efficient (like DPO) and performant (like PPO)
Concrete Example: In a scenario where the offline dataset contains only sub-optimal responses (poor coverage), DPO might mistakenly increase the likelihood of sub-optimal actions or fail to converge to the optimal policy, whereas online RLHF can correct itself by sampling and evaluating new actions.
Key Novelty
Theoretical Separation via Coverage & Hybrid Optimization (HyPO)
  • Establishes a theoretical hierarchy: Offline methods (DPO) need 'global coverage' (dataset covers all possible states), while online methods (RLHF) only need 'local coverage' (dataset covers the optimal policy's path)
  • Demonstrates that offline methods cannot guarantee control over the reverse KL divergence (drifting too far from the base model) when coverage is partial
  • Proposes HyPO: A method using offline data for contrastive learning (efficiency) while using online unlabeled data to enforce KL constraints (robustness)
Evaluation Highlights
  • HyPO outperforms DPO on the TL;DR summarization task, achieving a higher GPT-4 win rate (52.2% vs 46.5%) against the reference
  • HyPO maintains much lower reverse KL divergence to the reference policy compared to DPO (approx. 20 vs >100) while achieving higher rewards
  • On AlpacaEval 2.0 with UltraFeedback, HyPO exceeds DPO's length-controlled win rate (23.3% vs 21.0%)
Breakthrough Assessment
8/10
Provides a rigorous theoretical explanation for a widely observed empirical phenomenon (PPO > DPO) and successfully translates this theory into a practical, improved algorithm.
×