← Back to Paper List

Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?

Paul Gölz, Nika Haghtalab, Kunhe Yang
Cornell University
arXiv.org (2025)
RL P13N

📝 Paper Summary

AI Alignment Theory Social Choice Theory in AI Reinforcement Learning from Human Feedback (RLHF)
Theoretical analysis proves that standard alignment methods like RLHF and DPO can drastically fail to satisfy diverse user populations, whereas Nash Learning from Human Feedback guarantees near-optimal average utility.
Core Problem
State-of-the-art alignment methods (RLHF, DPO) aggregate diverse human preferences into a single 'mythical' reward model, which may fail to maximize the actual average utility across a heterogeneous population.
Why it matters:
  • Current methods may align with a majority group's preferences while ignoring minorities, leading to unfair outcomes
  • There is no theoretical guarantee that optimizing for a single representative proxy user actually improves the average satisfaction of real, diverse users
  • Blindly following ordinal preferences (A > B) without considering cardinal utility strength leads to suboptimal policy decisions
Concrete Example: Consider a scenario where a minority group strongly dislikes an output while a majority slightly prefers it. RLHF, acting like a Borda count voting rule, might select the output because it wins more pairwise comparisons, drastically lowering the population's average utility compared to a compromise option that everyone finds acceptable.
Key Novelty
Distortion of Alignment Framework
  • Adapts 'distortion' from social choice theory to quantify the worst-case ratio between the optimal achievable utility (if true preferences were known) and the utility achieved by an alignment method
  • Models users via individual Bradley-Terry models rather than a single ground truth, acknowledging that preference noise comes from population heterogeneity, not just sampling error
  • Analytically proves that Nash Learning from Human Feedback (NLHF) minimizes this distortion, acting as a 'Maximal Lotteries' voting rule that is robust to diverse preferences
Evaluation Highlights
  • Nash Learning from Human Feedback (NLHF) achieves minimax optimal distortion of (1/2 + o(1))β, guaranteeing a stable fraction of optimal utility regardless of population diversity
  • RLHF and Direct Preference Optimization (DPO) suffer exponential distortion e^(Ω(β)) in the alignment setting, meaning they can perform arbitrarily worse than optimal as preference strength increases
  • Standard RLHF is shown to be equivalent to the Borda voting rule, which has bounded distortion O(β^2) only in the unconstrained social choice setting, but fails under KL constraints
Breakthrough Assessment
9/10
Provides a rigorous theoretical foundation exposing a fundamental flaw in the dominant RLHF paradigm (aggregating diverse users into one reward model) and proves why game-theoretic approaches like NLHF are mathematically superior for pluralistic alignment.
×