← Back to Paper List

Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF

Tengyang Xie, Dylan J. Foster, Akshay Krishnamurthy, Corby Rosset, Ahmed Awadallah, Alexander Rakhlin
Microsoft Research, Massachusetts Institute of Technology
arXiv (2024)
RL

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Online Exploration in RL
XPO augments Direct Preference Optimization with a simple exploration bonus derived from theoretical principles of global optimism, enabling provably sample-efficient learning even when the initial model lacks coverage.
Core Problem
Existing online RLHF methods rely on passive exploration (sampling from the current policy), which fails to discover novel optimal behaviors if the initial model does not already cover them.
Why it matters:
  • Passive exploration requires exponentially many samples to find optimal policies if the starting model is not already good (Proposition 2.1)
  • Current methods cannot efficiently navigate the combinatorial space of token sequences to find responses that yield maximally informative feedback
  • Achieving super-human capabilities requires models to stray from pre-training data, which passive methods discourage or fail to support effectively
Concrete Example: In a bandit setting with a poor reference policy, Online DPO requires samples exponential in 1/beta to find the optimal arm. XPO's exploration bonus directs the model toward uncertain regions, breaking this dependency.
Key Novelty
Exploratory Preference Optimization (XPO)
  • Identifies that DPO implicitly performs Bellman error minimization for a Q* function in a KL-regularized MDP
  • Adds an exploration bonus to the DPO objective that implements 'global optimism', encouraging the model to generate responses where the uncertainty is high
  • Simple one-line change to the DPO loss function that is computationally tractable yet theoretically principled
Evaluation Highlights
  • XPO matches the performance of heuristic exploration baselines (Iterative DPO) using significantly fewer preference labels (3x less data in Figure 1)
  • First provably sample-efficient online exploration algorithm for RLHF with general function approximation
  • Theoretical guarantee of convergence to near-optimal policy regardless of initial model coverage
Breakthrough Assessment
8/10
Strong theoretical contribution linking DPO to Bellman error minimization and providing the first sample-efficient exploration guarantees for RLHF. Empirical results are proof-of-concept but promising.
×