← Back to Paper List

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

Wenyi Xiao, Zechuan Wang, Leilei Gan, Shuai Zhao, Zongrui Li, Ruirui Lei, Wanggui He, Luu Anh Tuan, Long Chen, Hao Jiang, Zhou Zhao, Fei Wu
Zhejiang University, China, Nanyang Technological University, Singapore, Alibaba Group, China
arXiv (2024)
RL MM Reasoning Factuality Benchmark

📝 Paper Summary

LLM Alignment Preference Optimization
This survey provides a comprehensive taxonomy and review of Direct Preference Optimization (DPO), categorizing theoretical challenges, algorithmic variants, datasets, and applications to guide future alignment research.
Core Problem
RLHF (Reinforcement Learning from Human Feedback) is computationally expensive, unstable, and complex due to separate reward model training and PPO optimization.
Why it matters:
  • RLHF requires meticulous hyperparameter tuning and extensive resources to maintain training stability
  • Explicit reward modeling in RLHF suffers from issues like reward hacking, misspecification, and poor out-of-distribution generalization
  • A lack of structured review on DPO limits the community's ability to identify emerging trends and address DPO's own limitations (e.g., alignment tax, biased policies)
Concrete Example: In standard RLHF, optimizing a policy requires loading a policy model, a reference model, a reward model, and a critic model into memory simultaneously. DPO simplifies this by optimizing the policy directly on preference data using a binary cross-entropy loss, eliminating the need for the explicit reward model and PPO loop.
Key Novelty
Structured Taxonomy of DPO Research
  • Categorizes DPO research into key questions: implicit reward modeling effects, KL penalty analysis, feedback types (pairwise vs. listwise), and online vs. offline dynamics
  • Compiles a comprehensive list of human-labeled and AI-labeled preference datasets specifically for DPO training
  • Reviews diverse applications beyond standard chat, including reasoning, hallucination reduction, and multi-modal generation
Evaluation Highlights
  • Identifies over 30 DPO variants (e.g., KTO, IPO, ORPO) that address specific limitations like overfitting or data scarcity
  • Catalogs over 20 preference datasets, distinguishing between human-labeled (e.g., HH-RLHF, HelpSteer) and AI-labeled (e.g., UltraFeedback, RLAIF-V) sources
  • Highlights the shift towards 'Online DPO' and iterative methods to close the performance gap between offline DPO and online RLHF
Breakthrough Assessment
8/10
Highly valuable as a foundational reference. While it is a survey and does not propose a new algorithm, its structured taxonomy and extensive coverage of datasets/variants make it a critical resource for the field.
×