← Back to Paper List

Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint

Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, Tong Zhang
arXiv (2023)
RL P13N Benchmark

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) LLM Alignment
The paper formulates RLHF as a reverse-KL regularized contextual bandit problem and proposes an iterative training algorithm that actively explores and generates new preference data to outperform static offline baselines.
Core Problem
Existing RLHF methods like offline PPO and DPO rely on fixed datasets that fail to cover the exponentially large response space, leading to poor reward model generalization and overfitting.
Why it matters:
  • Static datasets in offline RLHF limit the model's ability to learn from its own emerging behaviors, causing 'alignment tax' or performance degeneration.
  • Maximizing imperfect reward functions without strategic exploration leads to reward hacking, where models generate high-scoring but nonsensical text.
  • Current theory assumes deterministic optimal policies, but real-world generative models require stochastic policies to maintain diversity and fidelity.
Concrete Example: A 'safety reward' model might learn that refusing to answer always yields high safety scores. A deterministic maximizer (offline RL) would exploit this by refusing all prompts. In contrast, the proposed iterative approach would generate diverse responses, receive feedback that total refusal is unhelpful, and correct its policy.
Key Novelty
Iterative Direct Preference Optimization (Iterative DPO)
  • Formalizes the alignment process as a 'reverse-KL regularized contextual bandit,' providing a theoretical foundation that matches practical constraints (keeping the model close to the base).
  • Replaces static offline training with an iterative cycle: the current model generates new responses (exploration), these are labeled by an oracle/human, and the model is updated via DPO.
  • Treats the alignment process as 'online' learning, where the agent actively influences the data distribution it learns from, rather than passively ingesting a fixed batch.
Evaluation Highlights
  • Achieves a 34.79% win-rate on the AlpacaEval 2 benchmark using Zephyr-SFT-7B as the base model.
  • Empirically surpasses strong offline baselines like DPO (Direct Preference Optimization) and RSO (Rejection Sampling Optimization) in real-world experiments.
  • Demonstrates that RLHF benefits significantly from online exploration compared to learning solely from fixed offline datasets.
Breakthrough Assessment
8/10
Provides a rigorous theoretical grounding (contextual bandits) for a widely used heuristic (iterative training) and demonstrates significant empirical gains (state-of-the-art level for 7B models) on a respected benchmark.
×