Evaluation Setup
Offline-to-Online Reinforcement Learning on D4RL benchmarks
Benchmarks:
- kitchen-mixed-v0 (Robotic Manipulation)
- hopper-medium-v2 (Locomotion)
- walker2d-medium-v2 (Locomotion)
Metrics:
- Normalized Return
- Sample Efficiency
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| SPAARS-SUPE significantly outperforms its direct predecessor SUPE and improves sample efficiency. |
| kitchen-mixed-v0 |
Normalized Return |
0.75 |
0.825 |
+0.075
|
| Standalone SPAARS (CVAE-based) outperforms offline IQL baselines on locomotion tasks, validating the unordered-pair instantiation. |
| hopper-medium-v2 |
Normalized Return |
66.3 |
92.7 |
+26.4
|
| walker2d-medium-v2 |
Normalized Return |
78.3 |
102.9 |
+24.6
|
Main Takeaways
- Demonstrator alignment acts as a feature for safety but a bug for optimality; SPAARS effectively balances this tradeoff.
- Latent space exploration provides provable variance reduction (O(k/d)) compared to raw space exploration.
- The exploitation gap is a real, theoretically bounded ceiling for latent-only methods, necessitating a bridge to raw actions.
- Concurrent behavioral cloning of the raw policy during the latent phase is critical for stable curriculum transitions.