← Back to Paper List

Interactive Post-Training for Vision-Language-Action Models

Shuhan Tan, Kairan Dou, Yue Zhao, Philipp Krähenbühl
Nankai University
arXiv (2025)
MM RL Agent

📝 Paper Summary

Vision-Language-Action (VLA) Models Embodied AI Reinforcement Learning for VLMs
RIPT-VLA introduces a third training stage for Vision-Language-Action models that uses reinforcement learning with sparse binary rewards to drastically improve success rates and few-shot adaptation.
Core Problem
Current VLA training relies heavily on offline imitation learning, which fails to correct errors during rollout (distribution shift) and requires expensive, large-scale expert demonstrations for fine-tuning.
Why it matters:
  • VLA models trained only on offline data never see the consequences of their actions, leading to compounding errors in long-horizon tasks.
  • Collecting high-quality human demonstrations for every new task is slow and expensive, limiting scalability.
  • Few-shot performance is typically poor; models degrade significantly when only a small number of demonstrations are available.
Concrete Example: A VLA model trained via imitation learning might learn to reach for an object but fail to grasp it firmly. Because it never receives feedback on the failure during offline training, it cannot correct its grasp, leading to repeated failures (4% success rate) even after supervised fine-tuning.
Key Novelty
RIPT-VLA (Reinforcement Interactive Post-Training)
  • Adds a third 'post-training' stage after pre-training and supervised fine-tuning where the model interacts with the environment and receives simple success/failure feedback.
  • Uses a stable, critic-free reinforcement learning framework (LOOP) that estimates advantages by comparing multiple attempts at the same task (Leave-One-Out) rather than training a separate value network.
Evaluation Highlights
  • +10.9% absolute success rate improvement on average over the QueST baseline across all four task suites in the LIBERO benchmark.
  • Boosts the already strong 7B OpenVLA-OFT model from 96.7% to an unprecedented 97.5% success rate.
  • Achieves 97% success rate with only a single demonstration (1-shot), improving from a baseline SFT model's 4% success rate within 15 RL iterations.
Breakthrough Assessment
8/10
Demonstrates extreme data efficiency (1-shot learning) and high success rates using only sparse binary rewards, effectively addressing the data-scarcity bottleneck in embodied AI.
×