Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?

📝 Paper Summary

AI Alignment Theory Social Choice Theory in AI Reinforcement Learning from Human Feedback (RLHF)

Theoretical analysis proves that standard alignment methods like RLHF and DPO can drastically fail to satisfy diverse user populations, whereas Nash Learning from Human Feedback guarantees near-optimal average utility.

Core Problem

State-of-the-art alignment methods (RLHF, DPO) aggregate diverse human preferences into a single 'mythical' reward model, which may fail to maximize the actual average utility across a heterogeneous population.

Why it matters:

Current methods may align with a majority group's preferences while ignoring minorities, leading to unfair outcomes
There is no theoretical guarantee that optimizing for a single representative proxy user actually improves the average satisfaction of real, diverse users
Blindly following ordinal preferences (A > B) without considering cardinal utility strength leads to suboptimal policy decisions

Concrete Example: Consider a scenario where a minority group strongly dislikes an output while a majority slightly prefers it. RLHF, acting like a Borda count voting rule, might select the output because it wins more pairwise comparisons, drastically lowering the population's average utility compared to a compromise option that everyone finds acceptable.

Key Novelty

Distortion of Alignment Framework

Adapts 'distortion' from social choice theory to quantify the worst-case ratio between the optimal achievable utility (if true preferences were known) and the utility achieved by an alignment method
Models users via individual Bradley-Terry models rather than a single ground truth, acknowledging that preference noise comes from population heterogeneity, not just sampling error
Analytically proves that Nash Learning from Human Feedback (NLHF) minimizes this distortion, acting as a 'Maximal Lotteries' voting rule that is robust to diverse preferences

Evaluation Highlights

Nash Learning from Human Feedback (NLHF) achieves minimax optimal distortion of (1/2 + o(1))β, guaranteeing a stable fraction of optimal utility regardless of population diversity
RLHF and Direct Preference Optimization (DPO) suffer exponential distortion e^(Ω(β)) in the alignment setting, meaning they can perform arbitrarily worse than optimal as preference strength increases
Standard RLHF is shown to be equivalent to the Borda voting rule, which has bounded distortion O(β^2) only in the unconstrained social choice setting, but fails under KL constraints

Breakthrough Assessment

9/10

Provides a rigorous theoretical foundation exposing a fundamental flaw in the dominant RLHF paradigm (aggregating diverse users into one reward model) and proves why game-theoretic approaches like NLHF are mathematically superior for pluralistic alignment.

⚙️ Technical Details

Problem Definition

Setting: Aligning a policy π to maximize average population utility AvgUtil(π) using only pairwise comparisons from n heterogeneous users, subject to a KL-divergence constraint from a reference policy

Inputs: Pairwise comparisons {x > y} from n users, where each user i samples preferences from their unique utility vector u_i

Outputs: A policy π (distribution over alternatives) within a KL-ball of the reference policy

Pipeline Flow

Comparison Sampling (Users u_i sample pairs x,y)
Method-Specific Aggregation (RLHF fits Reward Model; NLHF builds Margin Matrix)
Policy Optimization (Constrained to KL-ball)

System Modules

Comparison Sampler

Generates ordinal feedback from heterogeneous users

Model or implementation: Individual Bradley-Terry Models

RLHF/DPO Optimizer (Alignment Method)

Optimizes policy assuming a single underlying reward function

Model or implementation: MLE Reward Model + PPO (or DPO direct optimization)

NLHF Optimizer (Alignment Method)

Optimizes policy as a strategy in a zero-sum preference game

Model or implementation: Maximal Lotteries (KL-constrained)

Novel Architectural Elements

Application of 'Distortion' metric from Social Choice Theory to the specific constraints of AI Alignment (KL regularization)
Formal mapping of RLHF to Borda Count and NLHF to Maximal Lotteries within a heterogeneous Bradley-Terry framework

Modeling

Base Model: Theoretical analysis applicable to any LLM policy class

Training Method: Theoretical Analysis of Alignment Objectives

Objective Functions:

Purpose: RLHF Reward Modeling.

Formally: Maximize likelihood of observed comparisons under a single shared Bradley-Terry model: sum[log σ(r(x) - r(y))].
Purpose: RLHF Policy Optimization.

Formally: Maximize expected reward E[r(x)] subject to KL(π || π_ref) <= τ.
Purpose: NLHF Optimization.

Formally: Find π* in argmax_π min_π' E[M(x, x')] s.t. KL constraints, where M is the pairwise margin matrix.

Key Hyperparameters:

beta: Bradley-Terry temperature parameter (controls 'sharpness' of preferences)
tau: KL divergence bound (radius of the permissible policy space)

Compute: Not reported in the paper

Comparison to Prior Work

vs. RLHF: RLHF assumes a ground truth reward model; this work proves that assumption leads to unbounded distortion with diverse users
vs. DPO: DPO is shown to be equivalent to RLHF in terms of distortion properties (sharing the same flaws)
vs. Maximal Lotteries: NLHF is the generalization of Maximal Lotteries to the KL-constrained alignment setting

Limitations

Assumes user comparisons follow Bradley-Terry models (probabilistic transitivity), which may not hold for all complex human preferences
Analysis focuses on a single state x (context), abstracting away the generalization problem across different prompts
Results are asymptotic or worst-case bounds; average-case performance on specific real-world datasets is not empirically simulated
Assumes utilities are bounded in [0,1]

Reproducibility

This is a theoretical paper providing proofs and bounds. No code or datasets were released as the primary contribution is mathematical analysis.

📊 Experiments & Results

Evaluation Setup

Theoretical derivation of distortion bounds (worst-case utility loss)

Benchmarks:

Social Choice Setting (Unconstrained policy optimization (no KL bound))
Alignment Setting (KL-constrained policy optimization)

Metrics:

Distortion (Ratio of Optimal/Achieved Utility)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Theoretical bounds derived for the Social Choice Setting (no KL constraints), showing RLHF behaves like Borda count.
Social Choice Setting	Distortion (RLHF/Borda)	0.5 * beta	*1.0 beta**	0.5 * beta
Social Choice Setting	Distortion (NLHF/Maximal Lotteries)	0.5 * beta	*0.5 beta**	0
Theoretical bounds for the Alignment Setting (with KL constraints) reveal catastrophic failure modes for RLHF/DPO.
Alignment Setting	Distortion (RLHF/DPO)	0.5 * beta	Exponential in beta	Exponential
Alignment Setting	Distortion (NLHF)	1.0	Linear in beta	N/A

Main Takeaways

Information Bottleneck: Ordinal feedback (comparisons) is fundamentally insufficient to perfectly maximize cardinal utility, incurring an unavoidable distortion of at least 0.5 * beta for any method.
Robustness of NLHF: Nash Learning from Human Feedback is the only method studied that maintains optimal distortion bounds across both unconstrained (social choice) and constrained (alignment) settings.
Fragility of RLHF/DPO: These methods are highly sensitive to the heterogeneity of the population; fitting a single reward model to diverse users can lead to exponentially bad outcomes relative to the optimum.
The 'mythical representative user' assumption in RLHF is theoretically unsound for maximizing average welfare in diverse populations.

📚 Prerequisite Knowledge

Prerequisites

Social Choice Theory (Voting rules, Distortion)
Reinforcement Learning from Human Feedback (RLHF)
Bradley-Terry Model
Kullback-Leibler (KL) Divergence
Game Theory (Nash Equilibrium)

Key Terms

Distortion: The worst-case ratio between the optimal average utility achievable (if full utility information were known) and the average utility achieved by a method using only ordinal comparisons

RLHF: Reinforcement Learning from Human Feedback—a method that fits a single reward model to pairwise comparisons and optimizes a policy against it

DPO: Direct Preference Optimization—an alignment method that optimizes policy likelihoods directly from preferences without an explicit reward modeling step

NLHF: Nash Learning from Human Feedback—an alignment method that finds a policy forming a Nash Equilibrium in a zero-sum preference game against other policies

Bradley-Terry Model: A probabilistic model predicting the outcome of a pairwise comparison based on the difference in latent utility scores of the two options

Maximal Lotteries: A probabilistic voting rule that selects a distribution over candidates corresponding to the Nash Equilibrium of the pairwise margin matrix

Borda Count: A voting rule that ranks candidates based on the total number of pairwise comparisons they win; shown here to be equivalent to RLHF's reward modeling objective

KL constraint: A restriction requiring the fine-tuned AI model's output distribution to remain mathematically close to the original pre-trained model's distribution