MolmoAct: Action Reasoning Models that can Reason in Space

📝 Paper Summary

Robotic Manipulation Vision-Language-Action (VLA) Models Embodied AI

MolmoAct improves robotic control by inserting intermediate spatial reasoning steps—specifically depth perception tokens and visual trajectory traces—between observation and action, rather than mapping pixels directly to control.

Core Problem

Current Vision-Language-Action (VLA) models map perception directly to control, lacking explicit spatial reasoning and depth understanding, which limits their adaptability, generalization, and explainability.

Why it matters:

Direct mapping models (VLAs) are brittle and struggle to transfer across tasks or scenes because they lack grounded 3D understanding
Robots need to reason about constraints and goals ('think before acting') to be adaptable, similar to how LLMs benefit from Chain-of-Thought
Language-only steering is often ambiguous; users need precise ways to guide robot behavior that standard VLAs do not support

Concrete Example: When a user wants a robot to move an object along a specific path, language commands like 'move left' are ambiguous regarding magnitude and trajectory. Current VLAs cannot easily accept visual corrections. MolmoAct allows users to draw a 'visual reasoning trace' directly on the image to steer the action.

Key Novelty

Action Reasoning Model (ARM) with Spatial Chain-of-Thought

Instead of reasoning in language, the model reasons in space by autoregressively generating depth perception tokens to understand 3D geometry
It then generates a 'visual reasoning trace' (a 2D polyline on the image) representing the planned end-effector path before predicting low-level actions
Uses a novel action tokenization scheme that maps continuous action bins to byte-level BPE symbols, preserving ordinal locality for better training stability

Architecture

The three-stage inference pipeline: Perception -> Planning -> Action.

Evaluation Highlights

Achieves 70.5% zero-shot accuracy on SimplerEnv Visual Matching tasks, surpassing closed-source baselines like pi0 and GR00T N1.5
Outperforms ThinkAct by +6.3% on long-horizon tasks within the LIBERO benchmark (86.6% average success)
Improves real-world task progression by +22.7% on bimanual tasks compared to pi0-FAST via fine-tuning

Breakthrough Assessment

8/10

Strong empirical gains over major baselines (pi0, GR00T) and a distinct architectural shift from direct VLA mapping to spatial reasoning/planning. The release of a mid-training dataset is also significant.

⚙️ Technical Details

Problem Definition

Setting: Robotic manipulation via Vision-Language-Action modeling

Inputs: RGB image I, language instruction T (and optional user-drawn visual trace)

Outputs: Sequence of depth tokens d, visual reasoning trace tokens tau, and action tokens a

Pipeline Flow

Vision Encoder (Processes RGB Image)
Depth Generator (Autoregressively predicts depth tokens)
Trace Generator (Autoregressively predicts 2D path waypoints)
Action Generator (Predicts low-level action tokens)

System Modules

Vision Encoder

Encodes RGB observation into visual embeddings

Model or implementation: ViT-L/14 (CLIP) or ViT-SO400M/14 (SigLIP2)

Depth Generator (Reasoning/Planning)

Generates discrete depth tokens to ground the scene in 3D

Model or implementation: Language Model Decoder (Shared Backbone)

Trace Generator (Reasoning/Planning)

Generates a 2D polyline (trace) representing the end-effector's future path

Model or implementation: Language Model Decoder (Shared Backbone)

Action Generator

Predicts precise robot actions based on the plan

Model or implementation: Language Model Decoder (Shared Backbone)

Novel Architectural Elements

Three-stage spatial reasoning pipeline (Depth -> Trace -> Action) within a single autoregressive VLM
Conditioning action generation on a generated 'visual reasoning trace' (polyline) overlaid on the image
Mapping discretized action bins to specific tokenizer BPE symbols (monotonically assigned) rather than arbitrary tokens, to preserve ordinal structure

Modeling

Base Model: MolmoAct-7B-D (SigLIP2 + Qwen2.5-7B) and MolmoAct-7B-O (OpenCLIP + OLMo2-7B)

Training Method: End-to-end next-token prediction (Behavior Cloning)

Objective Functions:

Purpose: Predict sequence of depth, trace, and action tokens.

Formally: Maximize log probability P(d|I,T) * P(tau|I,T,d) * P(a|I,T,d,tau)

Adaptation: Fine-tuning on robotics data after VLM pre-training

Training Data:

Subset of Open X-Embodiment (BC-Z, BridgeData V2, RT-1)
MolmoAct Dataset (10,000+ high-quality robot trajectories)
Target depth strings generated by specialist depth estimator (Depth Anything V2 + VQVAE)
Visual traces generated from future end-effector positions in training data

Key Hyperparameters:

action_bins: 256
depth_vocabulary_size: 128 (N)
depth_sequence_length: 100 (M)
+ 2 more
trace_points: 1 to 5 points (L)
pretraining_gpu_hours: 9,216

Compute: 9,216 GPU hours for pre-training (approx 5x less than GR00T N1.5)

Comparison to Prior Work

vs. ThinkAct: Reasons in 'space' (depth/traces) rather than language thoughts
vs. TraceVLA: Adds depth perception tokens as a precursor to trajectory generation
vs. pi0/GR00T: Fully open weights/data; uses structured spatial reasoning pipeline instead of direct mapping
+ 1 more
vs. RT-1: Uses modern VLM backbone (Molmo/Qwen) and spatial reasoning steps

Limitations

No specific limitations section explicitly detailed in the provided text excerpt (abstract/intro/method), but text implies reliance on 2D image-space traces which might suffer in complex occluded 3D scenarios without multi-view.
Requires a specialist depth estimator for generating training labels (distillation process).
Visual reasoning trace is limited to a simple polyline (L<=5 points), which may not capture complex non-linear maneuvers.

Reproducibility

Code: https://allenai.org/blog/molmoact

Publicly available. The authors release model weights, training code, the MolmoAct Dataset, and the action reasoning dataset. Uses open components (OLMo2, Qwen2.5, SigLIP2, OpenCLIP).

📊 Experiments & Results

Evaluation Setup

Simulation and Real-world robotic manipulation

Benchmarks:

SimplerEnv (Visual Matching tasks in simulation)
LIBERO (Long-horizon simulation tasks)
Real-world Manipulation (Single-arm and Bimanual tasks) [New]

Metrics:

Success Rate (%)
Task Progression (%)
Human-preference scores (Elo rating)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
LIBERO (Long-horizon tasks)	Success Rate	80.3	86.6	+6.3
Real-world (Single-arm)	Task Progression	Not reported in the paper	Not reported in the paper	+10
Real-world (Bimanual)	Task Progression	Not reported in the paper	Not reported in the paper	+22.7
Out-of-distribution generalization	Performance Score	Not reported in the paper	Not reported in the paper	+23.3
General performance	Average Improvement	Not reported in the paper	Not reported in the paper	+5.5

Main Takeaways

Incorporating spatial reasoning (depth + visual traces) significantly outperforms direct VLA mapping baselines (pi0, GR00T) in both sim and real-world.
The model is highly efficient to train (5x less compute than GR00T) due to the BPE-based action tokenization initialization.
Visual reasoning traces enable steerability: users can edit the trajectory line to guide the robot, which is found to be more reliable than language steering.
The release of the MolmoAct Dataset (10k mid-training trajectories) provides a measurable boost (+5.5%) to general performance.

📚 Prerequisite Knowledge

Prerequisites

Vision-Language Models (VLMs)
Tokenization (BPE)
Robotic Action Spaces (Joint positions/End-effector poses)
Autoregressive Generation

Key Terms

VLA: Vision-Language-Action models—systems that take images and text as input and directly output robot actions

ARM: Action Reasoning Models—the authors' proposed class of models that integrate perception, planning, and control in a structured pipeline

Visual Reasoning Trace: A 2D polyline generated by the model (or drawn by a user) on the input image, representing the planned path of the robot's end-effector

BPE: Byte-Pair Encoding—a tokenization method used in LLMs; here adapted to map continuous action values to text tokens

SimplerEnv: A simulation benchmark for evaluating robotic manipulation policies

LIBERO: A benchmark for lifelong robot learning, testing generalization and long-horizon task performance

Depth Perception Tokens: Discrete tokens representing quantized depth information, distilled from a specialist depth model

CoT: Chain-of-Thought—a reasoning technique where models generate intermediate steps; MolmoAct uses 'spatial' CoT (depth -> trace -> action)

SigLIP: A vision encoder model (Sigmoid Loss for Language Image Pre-training) used as a backbone