← Back to Paper List

Position: The Complexity of Perfect AI Alignment -- Formalizing the RLHF Trilemma

Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary
University of California, Berkeley (Berkeley AI Safety Initiative), AWS Generative AI Innovation Center, Meta AI, Stanford University, Northeastern University
arXiv (2025)
RL Factuality Benchmark P13N

📝 Paper Summary

AI Alignment Reinforcement Learning from Human Feedback (RLHF) AI Safety
The Alignment Trilemma proves that RLHF systems cannot simultaneously achieve full representativeness of diverse human values, polynomial computational tractability, and robustness against adversarial shifts.
Core Problem
Current RLHF relies on small, homogeneous datasets to remain computationally tractable, which mathematically necessitates sacrificing either the diversity of human values (representativeness) or safety against attacks (robustness).
Why it matters:
  • Models serving global populations (180+ countries) are trained on narrow 'WEIRD' data (Western, Educated, Industrialized, Rich, Democratic), erasing minority perspectives
  • Attempts to fix bias often reduce robustness, while improving robustness amplifies majority biases, leading to 'sycophancy' where models agree with user errors to maximize reward
  • Scaling compute yields diminishing returns due to a proven 'scaling wall' where complexity grows super-polynomially with context dimension
Concrete Example: A response considered 'helpful' (direct) in San Francisco is rated 'harmful' (impolite) in Tokyo. Capturing both views creates a noisy reward model (intractable); regularizing for tractability collapses the model to the majority view (erasing Tokyo); preserving the conflict makes the model vulnerable to adversarial inputs.
Key Novelty
The Alignment Trilemma
  • Formalizes three conflicting goals: capturing diverse values (epsilon-representativeness), efficient training (polynomial tractability), and safety (delta-robustness)
  • Proves a 'Scaling Wall': achieving both fairness and safety for global populations requires operations exponential in the context size, akin to P vs NP hardness
  • Reframes common RLHF failures (bias, hallucinations) not as bugs to be patched, but as unavoidable consequences of choosing tractability over the other two axes
Evaluation Highlights
  • Proves that joint alignment requires Ω(2^d_context) operations, which is super-polynomial when context dimension > 50
  • Demonstrates current RLHF uses ~10^3–10^4 samples for tractability but requires ~10^7–10^8 samples for true global representativeness
  • Estimates current systems accept high representativeness error (ε > 0.3–0.5) to achieve partial robustness (δ ≈ 0.1–0.2)
Breakthrough Assessment
9/10
Establishes a fundamental theoretical limit for the dominant alignment paradigm (RLHF), shifting the field from engineering fixes to strategic trade-offs.
×