← Back to Paper List

SPAARS: Safer RL Policy Alignment through Abstract Exploration and Refined Exploitation of Action Space

Swaminathan S K, Aritra Hazra
Department of Computer Science and Engineering
arXiv (2026)
RL

📝 Paper Summary

Offline-to-Online Reinforcement Learning Safe Exploration Latent Skill Discovery
SPAARS bridges the performance gap in offline-to-online RL by initially constraining exploration to a safe latent manifold and then selectively enabling raw action execution via a state-dependent advantage gate.
Core Problem
Offline-to-online RL faces a dilemma: raw action exploration is unsafe and high-variance, but latent space exploration is fundamentally limited by the decoder's reconstruction error (exploitation gap).
Why it matters:
  • Direct online fine-tuning of offline policies often causes 'catastrophic forgetting' due to high-variance updates
  • Existing latent-space methods (like OPAL or SUPE) hit a hard performance ceiling because they cannot execute actions finer than the decoder's reconstruction capability
  • Robotic agents need both the safety of behavioral priors and the precision of raw motor control to achieve true optimality
Concrete Example: In a kitchen manipulation task, a latent policy might navigate to a cabinet safely but fail to open it because the precise force required is outside the decoder's reconstruction capabilities. SPAARS would switch to raw control at the cabinet handle to execute the precise opening action.
Key Novelty
Advantage-Gated Latent-to-Raw Curriculum
  • Initializes exploration strictly within a low-dimensional latent manifold derived from offline data, ensuring safety and reducing gradient variance
  • Uses a shared critic to estimate the 'exploitation advantage' of raw actions over latent actions at each state
  • Dynamically switches control to the raw policy only when it provably outperforms the decoder, bypassing the reconstruction bottleneck without discarding safe priors
Evaluation Highlights
  • SPAARS-SUPE achieves 0.825 normalized return on kitchen-mixed-v0 vs. 0.75 for the SUPE baseline
  • SPAARS-SUPE demonstrates 5x better sample efficiency than SUPE on kitchen-mixed-v0 by warm-starting from a pretrained policy
  • Standalone SPAARS achieves 102.9 normalized return on walker2d-medium-v2, surpassing the IQL offline baseline of 78.3
Breakthrough Assessment
8/10
Identifies and formally bounds a critical theoretical limitation (exploitation gap) in prevalent latent RL methods and offers a rigorous, effective solution that improves both sample efficiency and final performance.
×