← Back to Paper List

Improving Offline RL by Blending Heuristics

Sinong Geng, Aldo Pacchiano, Andrey Kolobov, Ching-An Cheng
Princeton University, Boston University, Broad Institute, Microsoft Research
arXiv (2023)
RL Benchmark

📝 Paper Summary

Offline Reinforcement Learning Value Bootstrapping Data Relabeling
HUBL stabilizes offline RL by relabeling datasets with modified rewards and reduced discount factors derived from Monte-Carlo returns, blending reliable heuristics into the bootstrapping process to mitigate estimation errors.
Core Problem
Bootstrapping-based offline RL algorithms suffer from performance inconsistency and instability ('deadly triad') due to value estimation errors when learning from fixed datasets with limited support.
Why it matters:
  • Inconsistent performance prevents deployment in high-stakes fields like healthcare and robotics where online exploration is dangerous.
  • Even state-of-the-art offline RL methods can underperform simple behavior cloning on certain datasets due to fluctuations in bootstrapping stability.
Concrete Example: A standard offline RL algorithm like CQL might perform well on one dataset but fail on another (underperforming behavior cloning) because errors in the learned Q-function propagate during bootstrapping. HUBL fixes this by partially replacing these unstable learned values with actual observed Monte-Carlo returns.
Key Novelty
Heuristic Blending (HUBL)
  • Modifies the Bellman operator to mix bootstrapped values (from the neural network) with heuristic values (Monte-Carlo returns from the dataset).
  • Implemented efficiently as a pre-processing step that relabels the offline dataset with adjusted rewards and reduced discount factors, requiring no changes to the base RL algorithm's code.
Evaluation Highlights
  • +9% average policy quality improvement across 27 datasets (D4RL and Meta-World) when HUBL is added to four SoTA algorithms (ATAC, CQL, TD3+BC, IQL).
  • >50% relative performance improvement on specific datasets where base offline RL methods historically show inconsistent or poor performance.
Breakthrough Assessment
8/10
Significant and consistent empirical gains (9%) across a wide range of benchmarks and base algorithms. The method is theoretically grounded and extremely easy to implement (data relabeling), making it highly practical.
×