← Back to Paper List

Sharp Analysis for KL-Regularized Contextual Bandits and RLHF

Heyang Zhao, Chenlu Ye, Quanquan Gu, Tong Zhang
University of California, Los Angeles, University of Illinois Urbana-Champaign
arXiv (2024)
RL Reasoning

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Contextual Bandits
The paper theoretically proves that adding a specific constraint (KL-regularization) to reinforcement learning objectives allows for much faster learning (O(1/ε) sample complexity) compared to standard methods, provided the reference model has good coverage.
Core Problem
Current theoretical analyses of RLHF with KL-regularization show the same slow sample complexity (O(1/ε²)) as methods without it, failing to explain why KL-regularization works so well in practice.
Why it matters:
  • Reinforcement Learning from Human Feedback (RLHF) is central to training modern LLMs like ChatGPT and Claude, yet its theoretical foundations lag behind empirical success.
  • Prior theory neglects the specific benefits of KL-regularization, suggesting it offers no statistical speedup over standard bandit algorithms.
  • Understanding how reference policy coverage affects online RLHF is crucial for designing more efficient data collection strategies.
Concrete Example: In standard bandit theory, finding an optimal policy requires O(1/ε²) samples. However, in LLM fine-tuning, we often have a strong pre-trained reference model (e.g., Llama-3-Base). Current theory suggests this reference doesn't help speed up learning, contradicting the empirical success of methods like PPO (Proximal Policy Optimization) that rely heavily on staying close to the reference.
Key Novelty
Two-Stage Mixed Sampling with Sharp KL Analysis
  • Introduces a new mathematical decomposition of the learning objective that exploits the strong convexity of the KL-divergence term, unlike previous analyses that treated it generically.
  • Proposes a simple two-stage algorithm: first explore using a mix of the reference policy and a learned policy, then exploit the learned policy. This leverages the 'coverage' of the reference model to reduce the need for random exploration.
Evaluation Highlights
  • Achieves O(1/ε) sample complexity for KL-regularized contextual bandits, a significant improvement over the standard O(1/ε²).
  • Proves a matching lower bound of Ω(1/ε), confirming the proposed analysis is tight and optimal.
  • Shows that with good reference policy coverage, sample complexity depends only additively on the coverage coefficient (D), whereas prior work required multiplicative dependence (D²).
Breakthrough Assessment
8/10
Provides the first theoretical justification for the O(1/ε) acceleration observed in KL-regularized RLHF, bridging a major gap between theory and practice. The result fundamentally changes the understanding of why RLHF works efficiently.
×