Evaluation Setup
Offline pre-training on static datasets followed by online fine-tuning in the environment.
Benchmarks:
- AntMaze (Navigation / Locomotion)
- Franka Kitchen (Robotic Manipulation)
- Adroit (Dexterous Manipulation)
- Visual Pick-and-Place (Robotic Manipulation (Sparse Reward))
Metrics:
- Cumulative Return
- Success Rate
- Cumulative Regret
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| The provided text contains summary statistics but lacks the full results tables (Section 5 is omitted in source). The following entries reflect the specific numeric claims found in the Abstract and Introduction. |
| 11 Fine-tuning Benchmark Tasks |
Number of tasks with SOTA performance |
Not reported in the paper |
9 |
Not reported in the paper
|
| Selected Tasks (e.g., AntMaze, Kitchen) |
Performance Improvement |
Qualitative reference |
Qualitative reference |
30-40%
|
Main Takeaways
- Conservative offline RL methods (like CQL) suffer from a 'dip' in performance at the start of fine-tuning because their value estimates are too low compared to real returns.
- Calibration is key: Forcing learned Q-values to lower-bound the behavior policy's value prevents the agent from discarding its pre-trained policy in favor of random exploration.
- Cal-QL achieves this calibration efficiently, enabling the benefits of offline initialization to translate directly into faster online fine-tuning without an initial unlearning phase.