Fast-in-Slow: A Dual-System Foundation Model Unifying Fast Manipulation within Slow Reasoning

📝 Paper Summary

Vision-Language-Action (VLA) models Robotic manipulation Dual-system AI

FiS-VLA integrates a fast diffusion-based execution module directly into the final layers of a slow reasoning VLM, enabling high-frequency control without sacrificing semantic understanding.

Core Problem

Current Vision-Language-Action (VLA) models are either too slow for real-time control (due to massive parameters) or rely on separate, disjointed policy heads that fail to fully leverage the VLM's pretrained knowledge.

Why it matters:

Low operating frequencies in large VLMs (e.g., <5 Hz) cause latency that makes responsive closed-loop robotic control impossible in dynamic environments.
Existing dual-system approaches treat the fast policy (System 1) as a separate, lightweight appendage, preventing it from accessing the rich internal representations of the reasoning model (System 2).

Concrete Example: A standard VLA might correctly reason 'pick up the red cup' but fail to adjust its grip in real-time if the cup slips, because the reasoning loop takes too long (e.g., 200ms) to generate the next action. Meanwhile, a separate fast policy might react quickly but forget the semantic instruction 'red cup' if the connection to the VLM is weak.

Key Novelty

Unified Fast-in-Slow Architecture (FiS)

Repurposes the final transformer blocks of a large VLM (System 2) into a fast execution module (System 1) rather than attaching an external network.
System 1 generates high-frequency actions via diffusion, conditioned on the slow System 2's latent reasoning features and real-time 3D/proprioceptive inputs.
Uses an asynchronous design where System 2 updates high-level reasoning slowly (e.g., 1 Hz) while System 1 executes actions rapidly (e.g., 100+ Hz) using the most recent reasoning context.

Architecture

The overall architecture of FiS-VLA, showing the shared Vision Encoder, the System 2 VLM backbone, and the embedded System 1 execution module.

Evaluation Highlights

Achieves 117.7 Hz control frequency on an NVIDIA 4090 GPU (with action chunking), significantly faster than autoregressive VLA baselines.
Outperforms state-of-the-art OpenVLA by +11% success rate in real-world tasks and +8% in simulation.
Demonstrates superior generalization to unseen objects and backgrounds compared to synchronous dual-system methods like CogACT.

Breakthrough Assessment

8/10

Significantly improves the practicality of VLA models by solving the inference speed bottleneck while maintaining reasoning capabilities, validated on real hardware.

⚙️ Technical Details

Problem Definition

Setting: Imitation learning for robotic manipulation using a dual-system VLA policy.

Inputs: Language instruction l, history of observations o_{t-1} (images, point clouds, robot state).

Outputs: Sequence of actions a_{t:t+H} (SE(3) poses and gripper state).

Pipeline Flow

Visual Encoding (SigLIP + DINOv2 + 3D Tokenizer)
System 2 Reasoning (LLaMA2 Backbone - Early Layers)
System 1 Execution (LLaMA2 Backbone - Late Layers + Diffusion Head)

System Modules

Vision Encoder

Extract visual features from 2D images and 3D point clouds

Model or implementation: SigLIP + DINOv2 (shared)

System 2 (Reasoning)

Process multimodal context to generate high-level latent instructions

Model or implementation: LLaMA2-7B (First ~28 blocks)

System 1 (Execution)

Generate high-frequency action chunks based on reasoning context and real-time state

Model or implementation: LLaMA2-7B (Final transformer blocks) + Diffusion Head

Novel Architectural Elements

Repurposing final LLM blocks as the System 1 execution module (parameter sharing) instead of an external policy head.
Embedding the diffusion denoising process directly into the LLM's embedding space.
Asynchronous sampling architecture where System 1 queries System 2's latent output periodically.

Modeling

Base Model: Prismatic VLM (initialized from LLaMA2-7B + SigLIP/DINOv2)

Training Method: Dual-aware co-training (Autoregressive loss for Sys 2 + Diffusion loss for Sys 1)

Objective Functions:

Purpose: Train System 1 to generate precise actions via diffusion.

Formally: MSE loss between predicted noise and ground truth noise added to actions.
Purpose: Preserve System 2's reasoning and general knowledge.

Formally: Cross-entropy loss for next-token prediction on text/discrete actions.

Trainable Parameters: Full fine-tuning of LLaMA backbone and projectors (Vision encoders frozen)

Training Data:

Pretraining: >860K trajectories from Open X-Embodiment, DROID, RoboMIND
Fine-tuning: Self-collected real-world and RLBench simulation data

Key Hyperparameters:

base_model: LLaMA2-7B
action_horizon: 100 (diffusion steps T=100)
frequency_ratio: 1:4 (System 2 : System 1)
+ 1 more
action_chunk_size: 8

Compute: Inference: 117.7 Hz on single NVIDIA RTX 4090 GPU (with action chunking)

Comparison to Prior Work

vs. CogACT/PI_0: FiS-VLA embeds System 1 *inside* the VLM layers rather than appending a separate head, and uses asynchronous frequencies.
vs. OpenVLA: FiS-VLA uses diffusion for continuous action generation (System 1) instead of autoregressive token discretization.
vs. Helix [not cited in paper]: Helix also separates frequencies but runs models on separate GPUs; FiS-VLA integrates them into one model to share parameters and representations.

Limitations

Reliance on a fixed frequency ratio (e.g., 1:4) may not be optimal for all tasks.
Single-GPU implementation limits the potential massive parallelization of distinct systems.
Requires high-quality 3D data (point clouds) which adds sensing complexity.

Reproducibility

Code: https://fast-in-slow.github.io

Code available at project website. Pretraining datasets are open-source (Open X, DROID). Fine-tuning data (Real/Sim) collected by authors. Exact hyperparameters for training (LR, batch size) not fully detailed in the provided text.

📊 Experiments & Results

Evaluation Setup

Robotic manipulation in simulation (RLBench) and real-world (Franka Panda, AgileX, AlphaBot).

Benchmarks:

RLBench (Simulation benchmark for manipulation (18 tasks))
Real-World Suite (Real robot manipulation (picking, pouring, articulation)) [New]

Metrics:

Success Rate
Control Frequency (Hz)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Simulation results on RLBench comparing FiS-VLA against state-of-the-art baselines.
RLBench	Success Rate	78	86	+8
RLBench	Success Rate	58	86	+28
Real-world experiments validating performance on physical hardware.
Real-World Suite	Success Rate	72	83	+11
Inference speed analysis demonstrating the efficiency of the asynchronous design.
NVIDIA 4090	Control Frequency (Hz)	5.6	117.7	+112.1

Experiment Figures

Conceptual comparison between previous dual-system VLAs (Separate System 1) and FiS-VLA (Unified System 1).

Main Takeaways

Integrating System 1 within the VLM backbone is more effective than attaching it externally, likely due to better feature utilization.
Asynchronous frequency design allows for high-speed control (100Hz+) without losing the reasoning benefits of a 7B parameter model.
3D point cloud input significantly improves manipulation precision compared to using 2D images alone.
The co-training strategy successfully preserves the reasoning capabilities of the base VLM while learning precise motor control.

📚 Prerequisite Knowledge

Prerequisites

Vision-Language-Action (VLA) models
Diffusion policies for robotic control
Transformer architecture (specifically LLaMA)
Kahneman's Dual-System Theory (System 1 vs. System 2)

Key Terms

VLA: Vision-Language-Action model—a foundation model that takes visual and text inputs and directly outputs robot actions.

System 1: In dual-process theory, the fast, intuitive, and unconscious mode of thinking; here, the high-frequency motor control module.

System 2: In dual-process theory, the slow, logical, and deliberate mode of thinking; here, the VLM reasoning about high-level tasks.

Diffusion Policy: A method for generating robot actions by gradually denoising random noise, allowing for multimodal and precise action distributions.

Action Chunking: Predicting a sequence of future actions (a chunk) at once rather than just the single next step, used to handle temporal dependencies and latency.

Asynchronous Frequency: Running different parts of the model at different speeds; System 2 updates context slowly, while System 1 generates actions quickly.

Proprioception: The robot's internal sense of its own joint positions and velocities.

SE(3): Special Euclidean group in 3D—representing position (x, y, z) and orientation (rotation) of the robot end-effector.