← Back to Paper List

Efficient Online Reinforcement Learning with Offline Data

Philip J. Ball, Laura M. Smith, Ilya Kostrikov, S. Levine
University of Oxford, University of California, Berkeley
International Conference on Machine Learning (2023)
RL Benchmark

📝 Paper Summary

Offline-to-Online Reinforcement Learning Sample-Efficient RL Off-Policy RL
RLPD enables standard off-policy reinforcement learning algorithms to efficiently leverage offline data by combining symmetric data sampling, layer-normalized critics to prevent value divergence, and large ensemble updates.
Core Problem
Naive application of off-policy RL to offline data fails due to catastrophic Q-value overestimation on out-of-distribution actions, while specialized offline-to-online methods are overly complex and conservative.
Why it matters:
  • Pure online RL is sample-inefficient and dangerous in real-world settings (e.g., robotics), while pure offline RL cannot improve beyond the static dataset
  • Existing hybrid approaches require complex pre-training phases or explicit policy constraints that limit the agent's ability to explore and improve asymptotically
  • Standard off-policy algorithms (like SAC) theoretically should utilize offline data but fail in practice due to distribution shift instabilities
Concrete Example: In the D4RL 'AntMaze' task, naively running Soft Actor-Critic (SAC) with offline data results in near-zero returns because the critic's value estimates diverge to infinity for unseen actions. RLPD fixes this, solving the maze where naive SAC fails completely.
Key Novelty
RLPD (Reinforcement Learning with Prior Data)
  • Integration of Layer Normalization into the critic network, which mathematically bounds Q-value estimates by the network weights, preventing runaway overestimation without explicit constraints
  • A 'symmetric sampling' strategy that constructs every training batch with 50% online data and 50% offline data, ensuring stable gradients while allowing exploration
  • Use of high Update-To-Data (UTD) ratios combined with large critic ensembles (Random Ensemble Distillation) to rapidly absorb offline data information
Evaluation Highlights
  • Achieves ~2.5x improvement over prior state-of-the-art on the Adroit 'Door' task compared to IQL + Finetuning
  • Effectively 'solves' all 6 D4RL AntMaze tasks in less than one-third of the environment steps required by prior methods
  • Demonstrates 6x higher returns than DrQ-v2 on the V-D4RL 'Humanoid Walk' pixel-based task by effectively leveraging expert offline data
Breakthrough Assessment
9/10
Significantly outperforms complex prior methods using a surprisingly simple set of architectural modifications to standard algorithms. Sets a new standard for simplicity and performance in offline-to-online RL.
×