Xiang Li, Cristina Mata, Jong Sung Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, R. Burgert, Mu Cai, Yong Jae Lee, M. Ryoo
Stony Brook University,
University of Wisconsin-Madison
International Conference on Learning Representations
(2024)
LLaRA adapts pretrained Vision-Language Models for robot control by converting behavior cloning trajectories into visual conversation data and enhancing them with self-supervised auxiliary tasks like spatial reasoning.
Core Problem
Adapting pretrained Vision-Language Models (VLMs) to robotic control is difficult due to data scarcity and the models' lack of precise spatial awareness required for manipulation.
Why it matters:
Directly transferring VLMs to robotics often fails because standard vision-language data lacks the spatial precision needed for low-level control
Curating high-quality conversation-style data for robotics is non-trivial and hard to scale for new domains compared to standard computer vision tasks
Current approaches often rely on specialized tokens or architectures, limiting the efficient transfer of generalist VLM knowledge to robotic agents
Concrete Example:A standard VLM trained on image captions might describe a mug's color but fail to output the precise pixel coordinates needed for a robot gripper to grasp the handle, as it lacks fine-grained spatial grounding.
Key Novelty
Visuomotor Instruction Tuning with Self-Supervised Auxiliary Data
Converts standard robot behavior cloning data (state-action pairs) into text-based conversations where actions are represented as normalized image coordinates
Generates six types of auxiliary instruction-tuning datasets (e.g., spatial relationships, future prediction) from existing trajectories without requiring new human annotations
Uses a Description-Instruct-BC pipeline to handle multiple-image observations by converting reference images into textual descriptions via object detection
Architecture
The LLaVA model architecture adapted for LLaRA, showing the flow from image/text inputs to language output.
Breakthrough Assessment
7/10
Proposes a clever, scalable data generation pipeline that bridges the gap between general VLMs and specific robot policies without architectural changes, though the core innovation is primarily data-centric.
⚙️ Technical Details
Problem Definition
Setting: Behavioral Cloning (BC) over a Markov Decision Process (MDP)
Inputs: Current visual observation image, textual task description, and history of past actions
Outputs: Predicted future actions represented as text tokens (normalized 2D image coordinates and rotation angles)
Pipeline Flow
Input Processing (Image + Text Instruction)
Visual Encoding (CLIP)
Modal Alignment (MLP Projection)
Language Modeling (LLM Autoregression)
Action Decoding (Text to Robot Commands)
System Modules
Vision Encoder
Encodes the visual observation into feature embeddings
Model or implementation: CLIP-ViT-L-336px (frozen or fine-tuned not explicitly specified in snippet, implied LLaVA standard)
Adapter
Projects visual tokens into the language embedding space
Model or implementation: MLP (Multi-Layer Perceptron)
LLM
Generates the action response based on visual and textual context
Model or implementation: Vicuna-1.5-7B (based on LLaVA-1.5)
Novel Architectural Elements
Description-Instruct-BC Module: An input processing step that uses an off-the-shelf object detector to convert secondary reference images into textual descriptions, allowing a single-image VLM to handle multi-image tasks
Modeling
Base Model: LLaVA-1.5 (Vicuna-1.5-7B LLM + CLIP-ViT-L-336px Vision Encoder)
Training Method: Supervised Fine-Tuning (Instruction Tuning)
Objective Functions:
Purpose: Train the model to predict the next text token (action coordinates) given the image and instruction.
Formally: Standard autoregressive next-token prediction loss.
Adaptation: Fine-tuning of LLM and MLP adapter weights
Code, datasets, and pretrained models are available at https://github.com/LostXine/LLaRA. The paper describes templates for data generation in detail (Fig 4 and Appendix).
📊 Experiments & Results
Evaluation Setup
Simulated and real-world robot manipulation tasks evaluating policy success rates
Benchmarks:
Simulated Environments (Robot manipulation (specific benchmark names like LIBERO implied but not explicitly in text))
Real-world Experiments (Robot manipulation)
Metrics:
Success Rate
Statistical methodology: Not explicitly reported in the paper
Experiment Figures
Examples of the Instruct-BC and Description-Instruct-BC data formats.
Examples of the six auxiliary self-supervised tasks used to supercharge the dataset.
Main Takeaways
Visuomotor Instruction Tuning effectively aligns pretrained VLMs with robotic control by treating actions as text-based coordinate responses.
Self-supervised auxiliary datasets (localization, future prediction, etc.) generated from existing trajectories significantly improve policy performance without new human labels.
Describing reference images via text (Description-Instruct-BC) allows single-image VLMs to function effectively in tasks requiring multi-image context.
Note: Specific quantitative results (tables/metrics) were not contained in the provided text snippet.
📚 Prerequisite Knowledge
Prerequisites
Understanding of Vision-Language Models (VLMs) like LLaVA
Basics of Robot Learning and Behavior Cloning (BC)
Familiarity with Instruction Tuning
Key Terms
VLA: Vision-Language-Action models—VLMs adapted to generate robotic actions directly
Visuomotor Instruction Tuning: The process of fine-tuning a VLM on visual data where the 'answer' is a specific robot action described in text
Behavior Cloning: A method where a robot learns a policy by mimicking expert demonstrations (state-action pairs)
Self-Supervised Learning: Learning from data without manual labels, here used to generate auxiliary tasks like 'predict the next state' from video trajectories
LLaVA: Large Language and Vision Assistant—a specific VLM architecture combining a vision encoder, MLP adapter, and LLM
CLIP: Contrastive Language-Image Pre-training—a model used as the vision encoder to map images and text to a shared embedding space