← Back to Paper List

STARE-VLA: Progressive Stage-Aware Reinforcement for Fine-Tuning Vision-Language-Action Models

Feng Xu, Guangyao Zhai, Xin Kong, Tingzhong Fu, Daniel Gordon, Xueli An, Benjamin Busam
Technical University of Munich, Munich Research Center, Huawei Technologies, Imperial College London
arXiv.org (2025)
MM RL Agent

📝 Paper Summary

Robotic Manipulation Vision-Language-Action (VLA) Models Reinforcement Learning Fine-tuning
StARe-VLA improves robotic manipulation by decomposing monolithic action trajectories into semantic stages (e.g., Reach, Grasp) to provide dense, stage-specific reinforcement and preference signals.
Core Problem
Standard VLA fine-tuning methods (like TPO or PPO) optimize whole trajectories, leading to sparse rewards and ambiguous credit assignment where the model cannot identify which specific segment of a long-horizon task caused failure.
Why it matters:
  • Robotic tasks are naturally composed of causal stages (Reach → Grasp → Place); treating them as unstructured sequences ignores this dependency
  • Sparse terminal rewards in long-horizon tasks make exploration inefficient and training unstable
  • Existing monolithic optimization fails to distinguish between 'almost successful' (failed at the last step) and 'completely failed' trajectories
Concrete Example: In a pick-and-place task, a robot might successfully 'Reach' and 'Grasp' but fail to 'Place'. Standard trajectory-level optimization labels the entire sequence as a failure, discarding the successful learning of the first two stages.
Key Novelty
StARe (Stage-Aware Reinforcement) + IPI Pipeline
  • Decomposes trajectories into semantic stages (Reach, Grasp, Transport, Place) using rule-based end-effector geometric constraints
  • Applies Stage-Aware TPO (StA-TPO) to align preferences at the stage level rather than the trajectory level, using stage costs as penalties
  • Applies Stage-Aware PPO (StA-PPO) using dense, potential-based rewards shaped for each specific stage's goal
  • Unifies these into an Imitation → Preference → Interaction (IPI) serial fine-tuning pipeline
Evaluation Highlights
  • Achieves state-of-the-art success rate of 98.0% on SimplerEnv robotic manipulation benchmark
  • Achieves state-of-the-art success rate of 96.4% on ManiSkill3 tasks
  • Demonstrates substantial gains over monolithic trajectory-level optimization methods (like standard TPO and PPO)
Breakthrough Assessment
8/10
Addresses a fundamental limitation in VLA fine-tuning (credit assignment) with a logically sound, hierarchical approach. The reported success rates on standard benchmarks are very high (near saturation).
×