OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning

📝 Paper Summary

Vision-Language-Action (VLA) Models Robotic Manipulation Embodied Reasoning

OneTwoVLA integrates high-level reasoning and low-level control into a single model that autonomously decides when to reason and when to act, enabling efficient execution and error recovery.

Core Problem

Dual-system approaches separate high-level planners (System 2) from low-level actors (System 1), leading to capability mismatches where planners command infeasible actions and latency issues that hinder real-time responsiveness.

Why it matters:

Lack of mutual awareness between separated systems causes execution failures when the planner generates instructions the actor cannot physically perform.
Significant inference latency in large reasoning models prevents robots from reacting quickly to dynamic changes or errors during execution.

Concrete Example: In a 'Tomato-Egg' cooking task, a dual-system planner might instruct the robot to 'add green onion' because it is in the recipe, failing to realize no green onion is visible; the separated actor then stalls or behaves erratically.

Key Novelty

Adaptive Reasoning via Unified VLA

The model uses special tokens ([BOR] for reasoning, [BOA] for acting) to autonomously switch modes, reasoning only at critical moments (e.g., error detection) while acting efficiently otherwise.
Utilizes a scalable pipeline to synthesize 'reasoning-centric' vision-language data (using FLUX.1 and Gemini) which is co-trained with robot data to boost generalization.

Architecture

Inference flow of OneTwoVLA. The model takes visual and text inputs, first predicting a decision token ([BOR] or [BOA]).

Evaluation Highlights

+30% success rate improvement over the flat VLA baseline (pi_0) across three long-horizon manipulation tasks (Tomato-Egg, Hotpot, Cocktail).
+24% success rate improvement over a dual-system baseline (Gemini 2.5 Pro + pi_0) on the same long-horizon tasks.
Demonstrates zero-shot generalization to novel instructions (e.g., 'Help me stay awake' -> make coffee) by co-training with 16,000 synthetic vision-language samples.

Breakthrough Assessment

8/10

Successfully unifies System 1 and System 2 in robotics with a practical adaptive mechanism, showing significant gains over strong baselines and demonstrating how synthetic data can bridge the reasoning gap.

⚙️ Technical Details

Problem Definition

Setting: Robotic control policy capable of dual-mode operation (reasoning and acting)

Inputs: Current image observations I_t, reference images I_ref, language instruction l, and latest reasoning content R

Outputs: Updated reasoning content R_hat (text) OR Action chunk A_t (joint positions/gripper state)

Pipeline Flow

Visual Encoder (Processes history/current images)
Mode Selector (Predicts [BOR] or [BOA])
Branch A: Reasoning Generation (If [BOR], output text)
Branch B: Action Generation (If [BOA], output action chunk)

System Modules

Visual Encoder

Encodes multi-view camera inputs and reference images into embeddings

Model or implementation: Vision Transformer (from pi_0 base)

Mode Selector

Decides whether to update reasoning or execute actions

Model or implementation: Auto-regressive Transformer Head

Reasoning Generator

Generates textual reasoning (Plan, Scene Description, History, Next Step)

Model or implementation: VLM Decoder (auto-regressive)

Action Expert

Generates continuous robot actions based on reasoning and observation

Model or implementation: Flow Matching Policy (from pi_0)

Novel Architectural Elements

Adaptive switching mechanism via learned decision tokens ([BOR]/[BOA]) integrated into the VLA context window
Integration of structured reasoning history (Scene, Plan, History, Next Step) directly into the action conditioning context

Modeling

Base Model: pi_0 (Black et al., 2024)

Training Method: Supervised Fine-Tuning (SFT) + Flow Matching

Objective Functions:

Purpose: Train the reasoning and decision-making capabilities.

Formally: Cross-entropy loss for text tokens and decision tokens ([BOR]/[BOA])
Purpose: Train the continuous action generation.

Formally: Flow matching loss for the action expert component

Training Data:

Robot Data: Demonstration trajectories segmented into 'reasoning intervals' (annotated with plans/descriptions) and 'acting intervals'
Synthetic Data: 16,000 samples generated via Gemini 2.5 (text) -> FLUX.1 (image) -> Gemini (reasoning)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Hi Robot: OneTwoVLA is a single unified model rather than two separate models, enabling gradient-based alignment and reduced latency.
vs. pi_0.5: OneTwoVLA adaptively reasons only when necessary (sparse reasoning) rather than at every step, improving efficiency.
vs. ECoT-Lite: OneTwoVLA explicitly generates reasoning strings at inference time for better generalization and interpretability, whereas ECoT-Lite suppresses them.

Limitations

Relies on proprietary foundation models (Gemini, FLUX) for data synthesis pipeline.
Reasoning is still text-based; visual reasoning capabilities are implicit in the VLM backbone.
Requires carefully curated/annotated robot data with reasoning intervals for the 'robot' portion of training.

Reproducibility

Code: https://one-two-vla.github.io/

Project page available at https://one-two-vla.github.io/. Synthetic data generation pipeline uses Gemini 2.5 Pro and FLUX.1-dev (closed source/API dependencies). Base model pi_0 is used. Code release status is stated as project page availability, but specific repository URL is not explicitly in text snippet.

📊 Experiments & Results

Evaluation Setup

Real-world robotic manipulation with 7-DoF Franka arm and dual 6-DoF ARX arms.

Benchmarks:

Tomato-Egg (Long-horizon cooking (pouring, scooping)) [New]
Hotpot (Long-horizon sorting and precise placement) [New]
Cocktail (Long-horizon pouring with visual discrimination) [New]

Metrics:

Success Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
OneTwoVLA significantly outperforms both the flat VLA baseline and the dual-system baseline on long-horizon tasks.
Average (3 tasks)	Success Rate	57	87	+30
Average (3 tasks)	Success Rate	63	87	+24

Experiment Figures

Left: Success rates on 3 long-horizon tasks comparing OneTwoVLA with pi_0 and Dual-System. Right: Qualitative examples of generalizable planning.

Main Takeaways

Unified modeling prevents the 'capability mismatch' seen in dual-systems, where high-level planners command actions the low-level policy cannot execute.
Adaptive reasoning allows the model to maintain the speed of a flat policy (System 1) most of the time, only incurring latency costs when reasoning (System 2) is actually needed.
Co-training with synthetic vision-language data significantly enhances generalization, allowing the robot to understand abstract or novel instructions not present in robot training data.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Vision-Language-Action (VLA) models
Familiarity with Kahneman's System 1 (fast) vs. System 2 (slow) framework
Basics of diffusion policies or flow matching for action generation

Key Terms

VLA: Vision-Language-Action model—a foundation model trained to output robot actions directly from vision and language inputs

System 1: In cognitive science/AI, the component responsible for fast, intuitive, and automatic execution (acting) without explicit deliberation

System 2: The component responsible for slow, deliberate, and logical processing (reasoning/planning)

Flow Matching: A generative modeling technique used to train the continuous action distribution, serving as the 'action head' of the model

[BOR]: Beginning of Reasoning—a special decision token indicating the model should generate text reasoning

[BOA]: Beginning of Action—a special decision token indicating the model should generate physical robot actions

Co-training: Training a model simultaneously on multiple datasets (here, robot demonstration data and synthetic vision-language data) to transfer capabilities

DoF: Degrees of Freedom—the number of independent parameters that define the configuration of a robotic arm (e.g., 7-DoF arm)