← Back to Paper List

Reward-agnostic Fine-tuning: Provable Statistical Benefits of Hybrid Reinforcement Learning

Gen Li, Wenhao Zhan, Jason D. Lee, Yuejie Chi, Yuxin Chen
University of Pennsylvania, Princeton University, Carnegie Mellon University
Neural Information Processing Systems (2023)
RL

📝 Paper Summary

Hybrid Reinforcement Learning (Offline + Online) Reward-agnostic Exploration Policy Fine-tuning
A three-stage hybrid RL algorithm achieves provably better sample complexity than pure offline or online RL by leveraging offline data to guide reward-agnostic exploration of uncovered state-action pairs.
Core Problem
Pure offline RL fails when datasets lack full coverage of the optimal policy's path, while pure online RL ignores potentially useful prior data, leading to inefficient exploration.
Why it matters:
  • Offline datasets often suffer from 'partial coverage' (missing small but critical parts of the state space), making pure offline learning impossible.
  • Pure online RL is sample-inefficient because it must explore everything from scratch, wasting the information contained in historical data.
  • Existing hybrid RL theory often assumes strong 'all-policy concentrability' or fails to show benefits over pure online RL in tabular settings.
Concrete Example: Consider a robot navigation task where an offline dataset covers 90% of the path to the goal but misses the final room. Pure offline RL fails because it cannot learn the final steps. Pure online RL ignores the 90% solved path and re-explores everything. The proposed method uses the offline data to skip the known 90% and focuses exploration only on the missing 10%.
Key Novelty
Three-stage Reward-Agnostic Hybrid Exploration
  • Introduces 'single-policy partial concentrability' to quantify datasets that cover most but not all of the optimal policy's path, capturing the trade-off between distribution mismatch and coverage.
  • Uses a Frank-Wolfe-based algorithm to compute two exploration policies: one that imitates the offline data distribution and another that specifically explores the uncovered parts of the state space.
  • Decouples reward learning from exploration: the algorithm collects data without knowing the reward function, querying rewards only at the final offline RL stage.
Evaluation Highlights
  • Achieves sample complexity proportional to the uncovered fraction σ of the state space, yielding significant savings over pure online RL (where σ=1).
  • Outperforms pure offline RL by achieving finite sample complexity even when the offline dataset has partial coverage (where pure offline RL fails with infinite complexity).
  • Algorithm is adaptive to the unknown optimal trade-off σ between distribution mismatch and coverage, automatically finding the most efficient exploration strategy.
Breakthrough Assessment
8/10
Provides the first rigorous proof in the tabular setting that hybrid RL is statistically superior to both pure online and pure offline RL, relaxing standard coverage assumptions significantly.
×