Evaluation Setup
Offline training on fixed datasets, followed by online evaluation on unseen tasks/targets.
Benchmarks:
- Dark Room (DR) (2D GridWorld Navigation (Discrete))
- Dark Key-to-Door (K2D) (POMDP GridWorld (Discrete))
- MuJoCo (HalfCheetah, Ant, Hopper, Walker) (Continuous Control)
- XLand-MiniGrid (Meta-RL GridWorld)
Metrics:
- Normalized Area Under the Curve (NAUC)
- Return after N episodes (25, 50, 100)
- Interquartile Mean (IQM) of NAUC
- Statistical methodology: Performance profiles (rliable), Interquartile Mean (IQM), Confidence Intervals (95%)
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Overall comparison on Discrete Environments (aggregated across all datasets) shows RL methods outperforming AD. |
| Discrete Environments (Avg Test Targets) |
Improvement over AD |
0.0 |
28.8 |
+28.8
|
| Performance on challenging XLand-MiniGrid 'tiny' dataset showing large gains. |
| XLand-MiniGrid |
NAUC |
0.22 |
0.46 |
+0.24
|
| Impact of dataset expertise (quality). RL methods excel on low-quality data. |
| Early Datasets (Discrete) |
NAUC |
0.4 |
0.8 |
+0.4
|
| Continuous environment performance (MuJoCo). |
| Continuous Environments (Overall) |
Average Test NAUC |
0.6 |
0.8 |
+0.2
|
Main Takeaways
- Explicit RL objectives consistently outperform supervised AD, particularly on unseen test targets (+28.8% for CQL)
- RL methods are far more robust to data quality, excelling on 'early' (suboptimal) datasets where AD fails completely due to behavior cloning limitations
- Offline RL approaches (CQL, IQL) generally outperform standard online RL (DQN) in this setting, highlighting the need for conservatism
- RL methods handle unstructured data (randomly ordered trajectories) much better than AD, which relies on the sequential structure of learning histories
- In continuous domains, offline RL (TD3+BC, IQL) outperforms both AD and online RL (TD3), proving the necessity of offline regularization