← Back to Paper List

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, Percy Liang
Robotics (2025)
MM RL Benchmark

📝 Paper Summary

Vision-Language-Action Models (VLAs) Robot Learning Fine-tuning strategies
OpenVLA-OFT optimizes VLA fine-tuning by combining parallel decoding, action chunking, continuous action regression, and FiLM to achieve high-frequency control and state-of-the-art performance.
Core Problem
Current VLA fine-tuning methods rely on autoregressive generation, which is too slow (3-5 Hz) for high-frequency control and often yields unreliable performance on complex bimanual tasks.
Why it matters:
  • Autoregressive generation prevents real-time deployment on high-frequency robots (25-50+ Hz), limiting the practical utility of large VLAs.
  • Existing efficiency solutions like faster tokenization still suffer from significant latency (e.g., 750ms) between action chunks.
  • Practitioners lack a clear recipe for adapting VLAs to new robots, often defaulting to suboptimal pretraining objectives that fail on dexterous tasks.
Concrete Example: When fine-tuned with the standard autoregressive recipe, OpenVLA operates at only 3-5 Hz and fails to execute bimanual tasks like folding clothes reliably. In contrast, the proposed OFT recipe runs at high frequency and successfully manipulates objects by generating actions in parallel.
Key Novelty
Optimized Fine-Tuning (OFT) Recipe for VLAs
  • Replaces token-by-token autoregressive generation with parallel decoding, allowing the model to predict an entire chunk of future actions in a single forward pass.
  • Switches from discrete token classification to continuous L1 regression, improving precision and eliminating quantization artifacts without complex diffusion steps.
  • Integrating FiLM (Feature-wise Linear Modulation) to inject language goals directly into visual features, fixing 'spurious correlation' issues where the robot ignores instructions.
Evaluation Highlights
  • Achieves 97.1% success rate on LIBERO benchmark, surpassing standard fine-tuned OpenVLA (76.5%) and Google's π0 (94.2%).
  • Increases action generation throughput by 26× with 8-step chunks and up to 43× with 25-step chunks compared to base OpenVLA.
  • Outperforms diffusion-based policies (π0, RDT-1B) and scratch-trained policies (ACT, Diffusion Policy) by up to 15% absolute success rate on real-world ALOHA tasks.
Breakthrough Assessment
9/10
Establishes a new SOTA on standard benchmarks while solving the critical inference latency bottleneck of autoregressive VLAs, making large 7B models practical for real-time high-frequency control.
×