Evaluation Setup
Fine-tuning offline RL agents on continuous control tasks
Benchmarks:
- D4RL (Offline RL Benchmarks (Kitchen, AntMaze, MuJoCo))
Metrics:
- Success Rate
- Discounted Return
- TD-Error
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Experiments demonstrate that standard offline RL methods fail catastrophically when fine-tuning without retaining offline data. |
| D4RL kitchen-partial |
Success Rate |
Not reported in the paper |
0 |
Not reported in the paper
|
Main Takeaways
- Retaining offline data prevents value divergence but slows down asymptotic learning compared to pure online RL
- Without offline data, Q-values under the offline distribution diverge significantly, leading to forgetting
- A short warmup phase with a frozen policy is sufficient to 'recalibrate' the Q-function, enabling successful fine-tuning without old data
- WSRL combined with high-UTD online RL achieves faster learning and better final performance than methods constrained by offline data retention