SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

📝 Paper Summary

Vision-Language-Action (VLA) Models Robotic Manipulation Reinforcement Learning (RL)

SimpleVLA-RL adapts Large Reasoning Model reinforcement learning techniques to robots, using simple binary success rewards to dramatically improve manipulation performance and generalization without expensive human data.

Core Problem

Training effective robotic policies via Supervised Fine-Tuning (SFT) requires scarce, expensive human-operated data and fails to generalize to unseen objects or environments.

Why it matters:

Scaling robot learning is currently bottlenecked by the high cost of collecting human teleoperation trajectories (demonstrations).
Current VLA models struggle with distribution shifts; they perform well on exact training setups but fail when objects or scenes vary slightly.
Traditional robotic RL relies on complex, hand-crafted process rewards (e.g., 'distance to object'), which are hard to design and don't scale across diverse tasks.

Concrete Example: In the LIBERO-Long benchmark, an SFT-trained model provided with only one demonstration per task achieves a success rate of only 17.1% because it overfits to the single example. In contrast, SimpleVLA-RL explores the environment to find a robust policy, reaching 91.7% success.

Key Novelty

End-to-End Online Rule-Based RL for VLA

Adapts the GRPO (Group Relative Policy Optimization) algorithm from Large Language Models (LLMs) to Vision-Language-Action models, using only binary 'success/fail' outcome rewards.
Introduces exploration-enhancing mechanisms like 'Dynamic Sampling' (discarding batches with identical outcomes) and higher sampling temperatures to prevent the policy from collapsing into local optima.

Architecture

The SimpleVLA-RL training framework flow.

Evaluation Highlights

Achieves 91.7% success on LIBERO-Long with only a single demonstration per task, compared to 17.1% for the SFT baseline (+74.6% improvement).
Outperforms the state-of-the-art model π_0 (pi-zero) on RoboTwin 1.0 & 2.0 benchmarks.
Exploration strategies (Dynamic Sampling, High Temp, Clip Range) yield consistent performance improvements of 10–15% over standard RL baselines.

Breakthrough Assessment

9/10

Successfully transfers the 'DeepSeek-R1' RL paradigm to robotics, demonstrating massive gains in data efficiency and generalization with a surprisingly simple reward structure.

⚙️ Technical Details

Problem Definition

Setting: Vision-Language-Action (VLA) policy learning via Online Reinforcement Learning

Inputs: RGB images (single-view), language instructions l_task, proprioceptive state o_prop

Outputs: Action tokens representing robot control commands (e.g., end-effector pose changes)

Pipeline Flow

Observation Encoding
VLA Policy Inference
Environment Interaction (Closed Loop)

System Modules

Vision Encoder

Encodes single-view RGB image

Model or implementation: SigLIP/DINOv2 (inherited from OpenVLA)

VLA Transformer (VLA Policy Inference)

Autoregressively generates action tokens

Model or implementation: OpenVLA-OFT (LLaMA-2-7B backbone)

Action Decoder (VLA Policy Inference)

Detokenizes model output into continuous robot actions

Model or implementation: Discrete Token De-quantizer

Environment Interaction

Executes actions and updates state

Model or implementation: Simulator (RoboTwin/LIBERO) or Real Robot

Novel Architectural Elements

Integrated training-inference-rendering framework extending veRL for VLA-specific interactive sampling
Parallel multi-environment rendering architecture to accelerate VLA rollout throughput

Modeling

Base Model: OpenVLA-OFT (based on LLaMA-2-7B)

Training Method: Group Relative Policy Optimization (GRPO) with customizations

Objective Functions:

Purpose: Optimize policy to maximize binary outcome rewards without a value function critic.

Formally: GRPO objective with importance sampling ratio and group-normalized advantages A_i = (R_i - mean(R_group)) / std(R_group).
Purpose: Constrain policy updates to stay within a trust region.

Formally: PPO-style clipping with range [0.8, 1.28] (asymmetric upper bound to encourage exploration).

Adaptation: Full-parameter training (8x A800 GPUs)

Trainable Parameters: Full model (7B parameters)

Training Data:

Interactive rollouts in simulation environments (LIBERO, RoboTwin)
Simple outcome rewards (1 for success, 0 for failure) assigned to entire trajectory

Key Hyperparameters:

learning_rate: 5e-6
batch_size: 64 (training)
mini_batch_size: 128
+ 6 more
sampling_count_G: 8
clip_ratio_low: 0.2
clip_ratio_high: 0.28
temperature: 1.6
action_chunk_size: 8 (LIBERO) or 25 (RoboTwin)
action_vocab_size: 256 tokens

Compute: 8 x NVIDIA A800 80GB GPUs

Comparison to Prior Work

vs. OpenVLA (Standard): SimpleVLA-RL uses RL (GRPO) instead of just SFT, uses single-view (no wrist cam), and uses a classification head for actions.
vs. DeepSeek-R1: Applied to VLA/Robotics domain instead of pure text; requires interactive environment rollout instead of autoregressive text generation.
vs. Traditional Robotic RL (e.g., PPO on states): Uses VLA (vision-language inputs) and token-based action generation; uses binary outcome rewards instead of dense process rewards.

Limitations

RL training is significantly slower and more costly than SFT due to the need for multi-round interactive environment rollouts.
Reliance on simulation for training; while Sim-to-Real is shown, training directly on real robots is likely prohibitive due to sampling costs.
Requires tasks where binary success/failure can be automatically detected (or requires a vision-language reward model).

Reproducibility

Code: https://github.com/PRIME-RL/SimpleVLA-RL

Code is publicly available at https://github.com/PRIME-RL/SimpleVLA-RL. The authors implement a modified OpenVLA-OFT (removing wrist cameras, changing output head to cross-entropy) and re-train SFT from scratch. Simulation environments (LIBERO, RoboTwin) are public. Training uses 8x A800 GPUs.

📊 Experiments & Results

Evaluation Setup

Simulation benchmarks (LIBERO, RoboTwin) and Real-World deployment on Agilex Piper robot.

Benchmarks:

LIBERO (Lifelong learning manipulation (Goal, Spatial, Object, Long, 90))
RoboTwin 1.0 & 2.0 (Dual-arm manipulation with domain randomization)

Metrics:

Success Rate (SR)
Statistical methodology: Average success rate across 50 (LIBERO) or 100 (RoboTwin) held-out test scenarios.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on LIBERO benchmark demonstrating extreme data efficiency compared to SFT.
LIBERO-Long (1 demo)	Success Rate	17.1	91.7	+74.6
RoboTwin 2.0	Success Rate	Not reported in the paper	Not reported in the paper	Not reported in the paper
Ablation studies on exploration strategies show the cumulative value of specific RL modifications.
LIBERO-Spatial	Success Rate	73.2	88.6	+15.4

Experiment Figures

Ablation study bar charts showing the impact of exploration enhancements on success rate.

Main Takeaways

RL with simple binary outcome rewards allows VLA models to generalize significantly better than SFT, especially when demonstration data is scarce (1 demo setting).
Exploration is critical: Modifications like Dynamic Sampling (discarding uninformative batches), higher rollout temperature (1.6), and asymmetric clipping are essential for success.
The 'Pushcut' phenomenon emerges: RL-trained policies discover novel, efficient physical behaviors (e.g., pushing to cut) that were never present in the supervised training data.
Sim-to-Real transfer is effective: Policies trained in simulation with domain randomization transfer to real-world tasks without requiring real-world RL training.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO/GRPO)
Vision-Language-Action (VLA) models
Supervised Fine-Tuning (SFT)

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

VLA: Vision-Language-Action—multimodal models that take visual and language inputs to generate robotic actions

SFT: Supervised Fine-Tuning—training a model on a dataset of expert demonstrations (human teleoperation)

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by normalizing rewards within a group of sampled outputs for the same input, eliminating the need for a critic model

LLM: Large Language Model

veRL: Volcano Engine Reinforcement Learning—a library for RL training of LLMs, which this paper extends for VLAs

Process Reward: A reward signal given at intermediate steps (e.g., 'distance to object'), often manually engineered

Outcome Reward: A sparse binary reward given only at the end of a task (Success=1, Failure=0)

Pushcut: A phenomenon where the RL-trained policy discovers novel, efficient manipulation behaviors (like pushing an object to cut it) not present in the SFT training data

Proprioceptive State: Internal sensing of the robot's own body, such as joint angles or end-effector position

Dynamic Sampling: A strategy during RL rollout where batches containing identical rewards (all success or all failure) are discarded to prevent vanishing gradients

Action Chunking: Predicting a sequence of future actions (a chunk) in one forward pass rather than a single step, used to improve temporal consistency

Sim-to-Real: Transferring a policy trained in a physics simulation to a physical robot in the real world