VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

📝 Paper Summary

Language-conditioned robotic manipulation Zero-shot trajectory synthesis Vision-Language Models (VLMs) for robotics

VoxPoser enables robots to perform open-ended manipulation tasks by using LLMs to write code that composes 3D cost maps, which then guide motion planners without requiring task-specific training data.

Core Problem

Existing methods for language-conditioned robot manipulation typically rely on pre-defined motion primitives or large-scale labeled robotic data, which bottlenecks generalization to new tasks and objects.

Why it matters:

Pre-defined primitives limit the diversity of fine-grained actions a robot can perform.
Collecting large-scale robotic data annotated with language instructions is expensive and laborious.
Directly outputting high-frequency control signals from LLMs is impractical due to high dimensionality.

Concrete Example: Given the instruction 'open the top drawer and watch out for the vase', standard methods might fail if they lack a specific 'avoid_vase' primitive. VoxPoser generates a 3D map where the drawer handle has high value (attraction) and the vase's vicinity has low value (repulsion), guiding the planner naturally.

Key Novelty

LLM-synthesized 3D Value Maps for Planning

Uses LLMs (like GPT-4) to generate Python code that calls VLM APIs (like OWL-ViT) to locate objects.
The generated code composes dense 3D voxel grids (value maps) representing affordances (where to go) and constraints (what to avoid) in observation space.
These value maps serve as objective functions for a standard model-based motion planner (MPC), enabling zero-shot execution without training.

Architecture

The complete VoxPoser pipeline from instruction to motion planning.

Evaluation Highlights

Achieves 88.0% success rate on everyday real-world manipulation tasks, compared to 24.0% for Code as Policies (LLM + Primitives).
Demonstrates 70% success rate under dynamic disturbances in the real world (0% for baselines).
Outperforms learned cost-map baselines (U-Net) by large margins in simulation on unseen instructions (76.7% vs 0.0% for composition tasks).

Breakthrough Assessment

9/10

A significant leap in zero-shot robotic generalization. By bridging LLM reasoning with low-level planning via value maps, it removes the need for primitives or training data, solving a major bottleneck in embodied AI.

⚙️ Technical Details

Problem Definition

Setting: Open-set robotic manipulation given free-form natural language instructions and RGB-D observations.

Inputs: Language instruction L and sequence of RGB-D observations

Outputs: Dense sequence of 6-DoF end-effector waypoints (robot trajectory)

Pipeline Flow

User Instruction -> Planner LMP -> Sub-tasks
Sub-task + Observation -> Composer LMP -> Python Code
Python Code -> Perception APIs (OWL-ViT + SAM) -> Object Masks/Points
Python Code + Object Points -> NumPy Operations -> 3D Value Maps
3D Value Maps -> MPC Motion Planner -> Robot Trajectory

System Modules

Planner LMP

Decomposes high-level instructions into sequential sub-tasks

Model or implementation: GPT-4

Composer LMP

Writes Python code to invoke perception and generate value maps for a specific sub-task

Model or implementation: GPT-4

Perception Module

Grounds language queries to 3D spatial information

Model or implementation: OWL-ViT + SAM + XMEM

Motion Planner

Synthesizes trajectory by optimizing cost function defined by value maps

Model or implementation: Model Predictive Control (MPC) with Zeroth-order optimization

Novel Architectural Elements

Usage of LLM-generated code to construct dense 3D voxel maps (Value Maps) rather than calling motion primitives
Integration of LLM-defined cost maps directly into a low-level MPC optimization loop
Decoupling of semantic reasoning (LLM) from physical grounding (VLM + Planner) via a code interface

Modeling

Base Model: GPT-4 (for code generation)

Training Method: Zero-shot prompting (no training involved for the main framework)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Code as Policies: VoxPoser composes dense 3D spatial maps for continuous planning instead of invoking discrete, pre-defined primitives.
vs. LANI: VoxPoser is zero-shot and uses LLMs/VLMs directly, whereas LANI requires training data to learn the language-to-cost mapping.
vs. Inner Monologue: VoxPoser handles low-level motion synthesis (dense waypoints), whereas Inner Monologue focuses on high-level task sequencing.

Limitations

Relies heavily on the quality of external perception models (OWL-ViT, SAM); detection failures lead to task failure.
Requires a dynamics model for contact-rich tasks (though can learn one efficiently); currently uses simple heuristics for most tasks.
Prompt engineering is required to ensure the LLM generates correct NumPy code.
Inference speed limited by LLM API latency and perception pipeline (replanning at 5Hz is achieved but requires efficient implementation).

Reproducibility

Code: https://voxposer.github.io

Code and videos available at voxposer.github.io. Uses OpenAI API (GPT-4) which is closed source. Relies on OWL-ViT, SAM, and XMEM which are open source. Prompts are provided in Appendix.

📊 Experiments & Results

Evaluation Setup

Real-world robotic manipulation (Kinova Gen3) and Simulated block-world (PyBullet).

Benchmarks:

Real-World Everyday Tasks (5 tasks: Move & Avoid, Set Up Table, Close Drawer, Open Bottle, Sweep Trash) [New]
Simulated Block-World (13 randomizable tasks with 2766 unique instructions) [New]

Metrics:

Success Rate (%)
Statistical methodology: Reported mean and standard deviation for dynamics learning experiments. Success rates reported as percentages over 10 or 20 trials.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Real-world experiments comparing VoxPoser against 'LLM + Primitives' (Code as Policies) in static and dynamic environments.
Real-World Everyday Tasks (Average)	Success Rate	24.0	88.0	+64.0
Real-World Everyday Tasks (Average)	Success Rate	0.0	70.0	+70.0
Simulation results testing generalization to unseen instructions and attributes against learned baselines.
Simulated Block-World (Spatial Composition)	Success Rate	3.8	58.8	+55.0
Simulated Block-World (Spatial Composition)	Success Rate	25.0	76.7	+51.7

Experiment Figures

Qualitative visualization of 3D value maps and robot execution for various tasks.

Breakdown of error sources (Perception, Dynamics, Specification) for different methods.

Main Takeaways

Zero-shot generalization: VoxPoser handles open-set instructions and objects significantly better than baselines that require training or pre-defined primitives.
Robustness: The closed-loop nature (5Hz replanning) allows the system to adapt to dynamic changes and disturbances in real-time.
Efficient Learning: For contact-rich tasks (e.g., opening a door) where zero-shot physics is insufficient, VoxPoser trajectories serve as effective exploration priors, enabling dynamics learning in <3 minutes (vs. >12 hours for random exploration).
Error Analysis: Most failures stem from perception (detection/segmentation) rather than the LLM's reasoning or the planner's ability to optimize the map.

📚 Prerequisite Knowledge

Prerequisites

Model Predictive Control (MPC)
Vision-Language Models (open-vocabulary detection)
Prompt engineering for Code Generation
Coordinate transforms (voxel grids to robot space)

Key Terms

Value Map: A 3D voxel grid where each voxel contains a scalar value representing the cost or reward of the robot's end-effector being at that location.

Affordance Map: A type of value map indicating regions the robot should interact with or move towards (e.g., a handle).

Constraint Map: A type of value map indicating regions the robot should avoid (e.g., an obstacle like a vase).

MPC: Model Predictive Control—an optimal control method that optimizes a trajectory over a finite time horizon using a dynamic model.

LMP: Language Model Program—a modular prompting structure where LLMs generate code to solve sub-tasks, recursively calling other LMPs.

OWL-ViT: Open-World Localization Vision Transformer—an open-vocabulary object detection model.

SAM: Segment Anything Model—a model that can generate segmentation masks for objects given prompts like bounding boxes.

Voxel: A volume element; essentially a 3D pixel representing a point in a 3D grid.

Zero-shot: The ability to perform a task without having explicitly trained on examples of that specific task.

6-DoF: Six Degrees of Freedom—referring to position (x, y, z) and orientation (roll, pitch, yaw).