HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation

📝 Paper Summary

Robotic Manipulation Vision-Language-Action (VLA) Models

HAMSTER decouples manipulation into high-level VLM-based 2D path planning trained on cheap off-domain data and low-level 3D policies that execute these paths with high-frequency control.

Core Problem

Monolithic VLA models require expensive, scarce on-robot data and struggle with dexterity due to low inference frequency, while small specialist policies fail to generalize to new semantic instructions or visual variations.

Why it matters:

Collecting end-to-end on-robot data (observation-action pairs) is prohibitively expensive and currently insufficient for open-world generalization
Monolithic models cannot easily leverage abundant 'off-domain' data (simulations, human videos) because they require precise robot actions which are absent or mismatched in such data
Existing small policies are brittle to drastic environmental changes, limiting their utility in diverse real-world scenarios

Concrete Example: A standard robot policy trained on tabletop data might fail if the table color changes or the object description is semantically complex (e.g., 'the toy that looks like a cat'), whereas a VLM can understand the semantics but lacks the 3D precision to grasp it directly.

Key Novelty

Hierarchical Action Models with SeparaTEd Path Representations (HAMSTER)

Decomposes the control loop: A large VLM predicts a coarse 2D path (what/how to manipulate) from an image, and a small policy executes it using 3D observations
Intermediate Representation: Uses '2D paths' (end-effector trajectory + gripper state) as the interface, which can be extracted from cheap sources like simulation or videos without needing robot actions
Enables finetuning the high-level VLM on massive 'off-domain' datasets (simulation, pixel-tracking tasks, other robot embodiments) to bridge the sim-to-real gap

Architecture

Conceptual diagram of the HAMSTER hierarchical architecture

Evaluation Highlights

Achieves an average of 20% improvement in success rate across seven axes of generalization (embodiment, dynamics, visual appearance, etc.) compared to OpenVLA
Represents a 50% relative gain in success rate over the OpenVLA baseline in real-robot experiments
Demonstrates effective transfer from off-domain training data (simulation, diverse robot videos) to real-world deployment without seeing the test environment during VLM training

Breakthrough Assessment

8/10

Significantly addresses the data scarcity bottleneck in robotics by enabling VLAs to learn from abundant off-domain data (sim/video) via a hierarchical 2D path interface, showing strong real-world generalization.

⚙️ Technical Details

Problem Definition

Setting: Open-world robot manipulation given visual observations and language instructions

Inputs: RGB image 'img', language instruction 'z', proprioceptive state 's', optional depth/pointcloud 'o'

Outputs: Robot control actions 'a' (e.g., end-effector pose changes)

Pipeline Flow

Input Processing: Image + Language Instruction
High-Level Planning: VLM predicts 2D path
Path Processing: Simplification and Overlay
Low-Level Control: Policy generates 3D actions

System Modules

High-Level VLM

Predicts a coarse 2D path and gripper state changes conditioned on the image and instruction

Model or implementation: VILA-1.5-13b (finetuned)

Path Simplifier

Reduces the number of points in the predicted path to retain high-level guidance structure

Model or implementation: Ramer-Douglas-Peucker algorithm

Low-Level Policy

Generates precise robot actions using the 2D path as a visual guide alongside 3D observations

Model or implementation: RVT-2 or 3D-DA

Novel Architectural Elements

Hierarchical decoupling where the interface is strictly a '2D path' (normalized pixel coordinates + gripper) rather than language or sub-goals
Visual conditioning mechanism: The predicted 2D path is drawn/overlaid directly onto the image input of the low-level policy

Modeling

Base Model: VILA-1.5-13b (High-level VLM)

Training Method: Supervised Fine-Tuning (SFT) for VLM; Behavioral Cloning (BC) for Low-level Policy

Objective Functions:

Purpose: Finetune VLM to predict path coordinates.

Formally: Maximize log likelihood of answer tokens (path points) given image and prompt.
Purpose: Train low-level policy to mimic expert actions.

Formally: Maximize log likelihood of expert action a_i given state, observation, instruction, and projected 2D path p_i.

Training Data:

VLM Training Data (Off-domain):
1. RoboPoint: 770k pixel point prediction tasks (Locate object)
2. RLBench (Sim): 320k tuples from 1000 episodes x 81 tasks (trajectory projection)
3. Real Robot (Off-domain): 10k trajectories from Bridge (WidowX), 45k from DROID (Frankas)
4. Co-training: 660k general VQA samples

Compute: Not reported in the paper

Comparison to Prior Work

vs. OpenVLA: HAMSTER separates planning (VLM) and control (Policy), enabling use of action-free data and 3D inputs
vs. LLARVA: HAMSTER uses the path as an explicit input to a separate low-level controller, not just an aux loss
vs. RT-Trajectory: HAMSTER automates trajectory generation via a finetuned VLM rather than relying on humans
+ 1 more
vs. Traditional TAMP [not cited in paper]: HAMSTER uses learned visual representations and web-scale pretraining rather than geometric planning on known models

Limitations

Dependency on the quality of 2D path predictions; if the VLM fails to identify the object or path, the low-level policy fails
Low-level policies must still be trained on in-domain robot data (though less of it)
2D paths might lose information compared to 3D trajectories, though they are easier to extract from video
Inference latency of the large VLM (13B parameters) limits the frequency of high-level replanning

Reproducibility

Project page available at https://hamster-robot.github.io/. Visual results provided. Code described as 'fully open-sourced enabler' but explicit repo URL not listed in text. Training datasets (RoboPoint, RLBench, Bridge, DROID) are public/standard.

📊 Experiments & Results

Evaluation Setup

Real-world robot manipulation across diverse scenarios to test generalization

Benchmarks:

Real-Robot Generalization Suite (Tabletop manipulation (7 axes of generalization)) [New]

Metrics:

Success Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Specific numeric comparisons are not fully extracted as raw tables (e.g., exact success % per task) were not provided in the snippet, only aggregated relative gains.

Experiment Figures

Visualization of the 2D path representation

Examples of the diverse off-domain training data for the VLM

Main Takeaways

HAMSTER outperforms OpenVLA by an average of 20% in success rate across 7 generalization axes (embodiment, dynamics, visual appearance, semantics, etc.), representing a 50% relative gain.
VLM finetuning on off-domain data (sim, videos) is crucial; pre-trained VLMs struggle to predict valid manipulation paths zero-shot.
The hierarchical design allows the low-level policy to be robust to visual and semantic variations by relying on the VLM's path guidance.
Overlaying 2D paths onto images is an effective conditioning mechanism for low-level policies, enabling them to focus on local geometry rather than high-level planning.

📚 Prerequisite Knowledge

Prerequisites

Vision-Language Models (VLMs) and their finetuning
Imitation Learning / Behavioral Cloning
Robot Coordinate Systems (2D image plane vs. 3D workspace)
Sim-to-Real Transfer

Key Terms

VLA: Vision-Language-Action models—models that take vision and language as input and directly output robot actions

VLM: Vision-Language Model—a large transformer model trained on text and images to generate text (or tokens)

2D Path: A sequence of normalized 2D coordinates [(x, y, gripper)] on the image plane representing the desired end-effector trajectory

Proprioception: The robot's internal sense of its own joint positions and gripper state

Off-domain data: Data collected from sources different from the test environment, such as simulation, videos of humans, or different robot bodies

Sim-to-real: The challenge of transferring policies learned in physics simulation to the real physical world despite differences in visuals and physics

RVT-2: A specific 3D-aware robot policy architecture (Robotic View Transformer) that uses multi-view 3D representations

Ramer-Douglas-Peucker: An algorithm used to simplify a curve composed of line segments into a similar curve with fewer points