Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition

📝 Paper Summary

Robot Learning Language-Conditioned Manipulation Synthetic Data Generation

A framework using LLMs to plan and verify robotic exploration data, which is then distilled into a robust language-conditioned diffusion policy for real-world manipulation.

Core Problem

Acquiring robust, reusable manipulation skills typically requires costly human demonstrations or inefficient trial-and-error exploration.

Why it matters:

Human teleoperation and annotation are not scalable for large-scale data collection.
Existing automated exploration methods often lack optimality, generality, or complete robot data labels (vision, action, text).
Reinforcement learning exploration is inefficient for long-horizon, sparse-reward tasks.

Concrete Example: In a 'catapult' task, a standard planar planner might only solve the easiest goal (closest bin) deterministically, failing to explore other bins. Without 6DoF exploration and automatic retries, the robot never generates the diverse success data needed to learn the full task distribution.

Key Novelty

LLM-Guided Data Generation & Diffusion Policy Distillation

Use an LLM not as the final policy, but as a high-level planner that guides sampling-based robot utilities (grasping, motion planning) to generate diverse training data.
The LLM writes its own success-verification code, enabling a 'verify & retry' loop where the robot automatically recovers from failures during data collection.
Distill this messy, autonomously generated data into a multi-task diffusion policy that conditions on language and vision, inheriting robustness without needing expert demonstrations.

Architecture

The two-stage framework: (1) LLM-guided data generation and (2) Language-conditioned diffusion policy distillation.

Evaluation Highlights

Distilled policy improves success rates by +33.2% on average across five domains compared to the LLM data-collection policy itself.
Achieves 76% success rate in real-world Sim2Real transfer on a transport task with unseen objects.
The 'Verify & Retry' mechanism in data generation improves collection success rates by up to 13x (in Drawer domain) compared to no retries.

Breakthrough Assessment

8/10

Strong contribution in autonomous data scale-up. effectively bridging the gap between high-level LLM reasoning and low-level control via distillation, with impressive Sim2Real results.

⚙️ Technical Details

Problem Definition

Setting: Multi-task language-conditioned visuomotor control

Inputs: Task description (language), RGB camera views (wrist + global), proprioception

Outputs: Sequence of 6DoF end-effector poses and gripper commands

Pipeline Flow

Visual Encoders (ResNet18) extract features from images
Language Encoder (CLIP) extracts features from text
Feature Fusion (FiLM conditioning)
Diffusion Decoder generates action sequence

System Modules

Visual Encoders (Perception)

Encode wrist and global camera images into feature vectors

Model or implementation: ResNet18 (with spatial softmax)

Language Encoder (Perception)

Encode natural language task description

Model or implementation: CLIP (frozen)

Policy Network

Generate action sequences via conditional denoising

Model or implementation: Diffusion Policy (U-Net based)

Novel Architectural Elements

Integration of FiLM (Feature-wise Linear Modulation) conditioning into the Diffusion Policy architecture for multi-task language capability

Modeling

Base Model: Diffusion Policy (CNN-based U-Net backbone)

Training Method: Behavior Cloning via Diffusion (Supervised Learning)

Objective Functions:

Purpose: Minimize difference between predicted noise and actual noise added to expert actions.

Formally: MSE Loss on noise prediction (standard diffusion loss)

Training Data:

Data generated by LLM-guided exploration in MuJoCo simulation
Filtered for success using LLM-generated success conditions
Contains text labels for root tasks and subtasks

Key Hyperparameters:

inference_steps: 5 (DDIM)
training_steps: 50
action_horizon: 16
+ 2 more
prediction_horizon: 16
observation_horizon: 1 (vision) + history (proprioception)

Compute: Inference at approx 35Hz on NVIDIA RTX3080

Comparison to Prior Work

vs. Code-as-Policy: Uses LLM for data generation/planning only, not as the real-time policy; incorporates verify & retry loops
vs. BC-Z: Uses diffusion-based policy head instead of deterministic MLP/GMM; distills from suboptimal autonomous data rather than human experts
vs. Diffusion Policy: Extends to multi-task setting via language conditioning (FiLM)

Limitations

Data generation relies on privileged simulation state for success verification
Requires Sim2Real transfer; data generation limited to simulation assets
Evaluated primarily on root task success, not compositional sub-skill reuse

Reproducibility

Code: https://scalingup-distillingdown.github.io/

Code, data, and policy results available at project website. Sim2Real relies on specific hardware (UR5). Data generation requires MuJoCo and Google Scanned Objects.

📊 Experiments & Results

Evaluation Setup

MuJoCo simulation (5 domains) and Real World (UR5 robot)

Benchmarks:

Simulated Benchmark (18 tasks across 5 domains (Balance, Catapult, Transport, Mailbox, Drawer)) [New]
Real World Transport (Pick and place with novel objects) [New]

Metrics:

Success Rate (%)
Statistical methodology: Averaged over 200 episodes (Sim) or 10 episodes per object (Real)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Distillation performance compares the final learned policy against the data-generation oracle (LLM-as-Policy) and ablated versions.
Average across 5 domains	Success Rate	33.8	67.0	+33.2
Average across 5 domains	Success Rate	32.2	67.0	+34.8
Balance Domain	Success Rate	15.0	79.0	+64.0

Experiment Figures

Success rate vs. time for the Balance task, comparing the distilled policy against baselines.

Main Takeaways

Distilled policies significantly outperform the teacher (LLM planner) because they are reactive closed-loop policies, whereas the teacher is open-loop between planning steps.
The 'Verify & Retry' loop in data generation is critical; without it, complex tasks like Mailbox have 0% success in data collection.
Diffusion architecture drastically outperforms feed-forward (MLP) baselines for this type of multi-modal data.
Spatial softmax works better than mean pooling for the visual encoder in this setting.

📚 Prerequisite Knowledge

Prerequisites

Behavior Cloning (BC)
Diffusion Models for Control
Task and Motion Planning (TAMP)
Large Language Models (LLMs)

Key Terms

6DoF: Six Degrees of Freedom—ability to move in 3D space (x, y, z) and rotate around three axes (roll, pitch, yaw)

Diffusion Policy: A policy representation that generates robot actions by iteratively denoising random noise, conditioned on observations

DDIM: Denoising Diffusion Implicit Models—a sampling method used to speed up the diffusion process during inference

LLM: Large Language Model—AI models like GPT-3 used here for reasoning and code generation

RRT: Rapidly-exploring Random Tree—a sampling-based algorithm for path planning in high-dimensional spaces

TAMP: Task and Motion Planning—combining high-level symbolic planning with low-level geometric motion planning

Sim2Real: Transferring a policy learned in a physics simulation to a physical robot

FiLM: Feature-wise Linear Modulation—a technique to condition neural networks on auxiliary inputs (like language) by scaling and shifting feature maps

Proprioception: The robot's internal sense of its own joint positions and gripper state