LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

📝 Paper Summary

Vision-Language-Action (VLA) Models Robot Learning Instruction Tuning

LLaRA adapts pretrained Vision-Language Models for robot control by converting behavior cloning trajectories into visual conversation data and enhancing them with self-supervised auxiliary tasks like spatial reasoning.

Core Problem

Adapting pretrained Vision-Language Models (VLMs) to robotic control is difficult due to data scarcity and the models' lack of precise spatial awareness required for manipulation.

Why it matters:

Directly transferring VLMs to robotics often fails because standard vision-language data lacks the spatial precision needed for low-level control
Curating high-quality conversation-style data for robotics is non-trivial and hard to scale for new domains compared to standard computer vision tasks
Current approaches often rely on specialized tokens or architectures, limiting the efficient transfer of generalist VLM knowledge to robotic agents

Concrete Example: A standard VLM trained on image captions might describe a mug's color but fail to output the precise pixel coordinates needed for a robot gripper to grasp the handle, as it lacks fine-grained spatial grounding.

Key Novelty

Visuomotor Instruction Tuning with Self-Supervised Auxiliary Data

Converts standard robot behavior cloning data (state-action pairs) into text-based conversations where actions are represented as normalized image coordinates
Generates six types of auxiliary instruction-tuning datasets (e.g., spatial relationships, future prediction) from existing trajectories without requiring new human annotations
Uses a Description-Instruct-BC pipeline to handle multiple-image observations by converting reference images into textual descriptions via object detection

Architecture

The LLaVA model architecture adapted for LLaRA, showing the flow from image/text inputs to language output.

Breakthrough Assessment

7/10

Proposes a clever, scalable data generation pipeline that bridges the gap between general VLMs and specific robot policies without architectural changes, though the core innovation is primarily data-centric.

⚙️ Technical Details

Problem Definition

Setting: Behavioral Cloning (BC) over a Markov Decision Process (MDP)

Inputs: Current visual observation image, textual task description, and history of past actions

Outputs: Predicted future actions represented as text tokens (normalized 2D image coordinates and rotation angles)

Pipeline Flow

Input Processing (Image + Text Instruction)
Visual Encoding (CLIP)
Modal Alignment (MLP Projection)
Language Modeling (LLM Autoregression)
Action Decoding (Text to Robot Commands)

System Modules

Vision Encoder

Encodes the visual observation into feature embeddings

Model or implementation: CLIP-ViT-L-336px (frozen or fine-tuned not explicitly specified in snippet, implied LLaVA standard)

Adapter

Projects visual tokens into the language embedding space

Model or implementation: MLP (Multi-Layer Perceptron)

LLM

Generates the action response based on visual and textual context

Model or implementation: Vicuna-1.5-7B (based on LLaVA-1.5)

Novel Architectural Elements

Description-Instruct-BC Module: An input processing step that uses an off-the-shelf object detector to convert secondary reference images into textual descriptions, allowing a single-image VLM to handle multi-image tasks

Modeling

Base Model: LLaVA-1.5 (Vicuna-1.5-7B LLM + CLIP-ViT-L-336px Vision Encoder)

Training Method: Supervised Fine-Tuning (Instruction Tuning)

Objective Functions:

Purpose: Train the model to predict the next text token (action coordinates) given the image and instruction.

Formally: Standard autoregressive next-token prediction loss.

Adaptation: Fine-tuning of LLM and MLP adapter weights

Training Data:

Instruct-BC: Converted behavior cloning trajectories
Description-Instruct-BC: Augmented data where reference images are converted to text descriptions
Auxiliary Datasets: 6 self-supervised tasks (Localization, Detection, Action Prediction, Future Prediction, Spatial Relationships, Temporal Relationships)

Compute: Not reported in the paper

Comparison to Prior Work

vs. RT-2/OpenVLA: LLaRA uses natural language text coordinates normalized to [0,1] instead of specialized learned action tokens
vs. RoboPoint: LLaRA predicts actions directly in the instruction-response format without needing depth sensors for the policy output itself
vs. SpatialVLM: LLaRA integrates spatial tasks as auxiliary supervision for a control policy, whereas SpatialVLM is primarily a VQA model

Limitations

Relies on the quality of the base VLM (LLaVA) and its pretrained visual encoder
Single-image architecture requires object detection workaround for multi-image tasks
Auxiliary data generation depends on the accuracy of automated labeling (e.g., existing object detectors or heuristic scripts)

Reproducibility

Code: https://github.com/LostXine/LLaRA

Code, datasets, and pretrained models are available at https://github.com/LostXine/LLaRA. The paper describes templates for data generation in detail (Fig 4 and Appendix).

📊 Experiments & Results

Evaluation Setup

Simulated and real-world robot manipulation tasks evaluating policy success rates

Benchmarks:

Simulated Environments (Robot manipulation (specific benchmark names like LIBERO implied but not explicitly in text))
Real-world Experiments (Robot manipulation)

Metrics:

Success Rate
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Examples of the Instruct-BC and Description-Instruct-BC data formats.

Examples of the six auxiliary self-supervised tasks used to supercharge the dataset.

Main Takeaways

Visuomotor Instruction Tuning effectively aligns pretrained VLMs with robotic control by treating actions as text-based coordinate responses.
Self-supervised auxiliary datasets (localization, future prediction, etc.) generated from existing trajectories significantly improve policy performance without new human labels.
Describing reference images via text (Description-Instruct-BC) allows single-image VLMs to function effectively in tasks requiring multi-image context.
Note: Specific quantitative results (tables/metrics) were not contained in the provided text snippet.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Vision-Language Models (VLMs) like LLaVA
Basics of Robot Learning and Behavior Cloning (BC)
Familiarity with Instruction Tuning

Key Terms

VLA: Vision-Language-Action models—VLMs adapted to generate robotic actions directly

Visuomotor Instruction Tuning: The process of fine-tuning a VLM on visual data where the 'answer' is a specific robot action described in text

Behavior Cloning: A method where a robot learns a policy by mimicking expert demonstrations (state-action pairs)

Self-Supervised Learning: Learning from data without manual labels, here used to generate auxiliary tasks like 'predict the next state' from video trajectories

LLaVA: Large Language and Vision Assistant—a specific VLM architecture combining a vision encoder, MLP adapter, and LLM

CLIP: Contrastive Language-Image Pre-training—a model used as the vision encoder to map images and text to a shared embedding space