From Flow to One Step: Real-Time Multi-Modal Trajectory Policies via Implicit Maximum Likelihood Estimation-based Distribution Distillation

📝 Paper Summary

Robotic Manipulation Generative Policy Learning

This paper accelerates high-fidelity robotic manipulation policies by distilling a slow multi-step flow-matching teacher into a fast single-step student using a set-level Implicit Maximum Likelihood Estimation objective.

Core Problem

Generative policies like diffusion and flow matching capture multi-modal behaviors well but are too slow for high-frequency control due to iterative solving. Fast one-step alternatives often suffer from mode collapse, averaging diverse possibilities into invalid actions.

Why it matters:

Real-world robots require high-frequency control (>100Hz) to react to dynamic disturbances, while current generative policies often run at only 2-10 Hz
Standard distillation methods (like KL divergence or MSE) average out distinct modes, causing robots to fail tasks where multiple distinct valid trajectories exist (e.g., grasping an object from the left vs. right)

Concrete Example: In a task with two valid paths to an object (left or right), a standard behavior cloning or naive one-step student might output the average trajectory—going straight through an obstacle—resulting in collision.

Key Novelty

Set-Level IMLE Distillation for Flow Matching

Treats the slow teacher as an offline oracle that generates sets of valid future trajectories for a given observation
Trains a student to generate a corresponding set of hypotheses in one step, optimized via a bi-directional Chamfer distance
This set-based loss forces the student to cover all modes (diversity) and hit only valid modes (fidelity) without averaging them

Evaluation Highlights

Achieves 123.5 Hz inference speed on RLBench, a 14.3x speedup over the 50-step teacher (8.6 Hz)
Attains 68.6% success rate on RLBench with single-step inference, vastly outperforming Consistency Policy (16.3%) and Diffusion Policy 1-step (1.8%)
Real-world deployment achieves 70.0% success at 125 Hz, enabling dynamic re-planning where the 2.9 Hz teacher fails

Breakthrough Assessment

8/10

Significantly bridges the gap between the high performance of generative policies and the speed requirements of real-time control, solving the mode collapse issue in one-step distillation effectively.

⚙️ Technical Details

Problem Definition

Setting: Multi-modal trajectory generation for robotic manipulation under dynamic constraints

Inputs: Observation o_t = {RGB images, Depth maps, Point Cloud P, Proprioception S}

Outputs: Set of future action trajectories {τ_k} (horizon H)

Pipeline Flow

Perception Group: RGB+Depth+PCD → Multi-Modal Embedding
Policy Group: Noise + Embedding → Trajectory Generator (Student)

System Modules

Visual Encoder (Perception Group)

Extract features from RGB and Depth images

Model or implementation: Dual ResNet-18 backbones with shared latent projection

Fusion Module (Perception Group)

Fuse heterogeneous sensor inputs into a unified representation

Model or implementation: Symmetric Cross-Attention + Gating Network + PointNet (for PCD) + MLP (for Proprioception)

Student Policy

Generate action trajectory in a single step

Model or implementation: Temporal 1D U-Net (modified to remove time-conditioning)

Novel Architectural Elements

Removal of time-conditioning modules (sinusoidal encoding, FiLM) from the standard Diffusion/Flow U-Net for the student to enable instantaneous mapping
Symmetric cross-attention mechanism explicitly aligning RGB semantic features with Depth geometric features before fusion

Modeling

Base Model: Temporal 1D U-Net (Teacher and Student)

Training Method: Distillation via Implicit Maximum Likelihood Estimation (IMLE)

Objective Functions:

Purpose: Enforce mode coverage (Teacher -> Student).

Formally: L_cover = sum_i min_j ||τ_teacher^i - τ_student^j||^2
Purpose: Enforce mode seeking (Student -> Teacher) / Fidelity.

Formally: L_seek = sum_j min_i ||τ_teacher^i - τ_student^j||^2
Purpose: Combined distillation loss.

Formally: L = L_cover + L_seek (Symmetric Chamfer Distance)

Training Data:

Teacher trained on 100 expert demos per task (RLBench)
Student trained on dataset of K=16 trajectories per observation generated by frozen Teacher

Key Hyperparameters:

teacher_epochs: 2000
student_epochs: 1500
teacher_steps: 50
+ 4 more
student_steps: 1
generated_set_size_K: 16
trajectory_horizon_H: 32
inference_frequency_sim: 123.5 Hz

Compute: Inference speed: 125 Hz (Real-world), 123.5 Hz (Simulation)

Comparison to Prior Work

vs. Diffusion Policy: Single-step inference (125Hz) vs. multi-step iterative (8-10Hz)
vs. Consistency Policy: Uses set-level IMLE objective to preserve modes vs. consistency constraints which often collapse modes
vs. Naive One-Step Truncation: Explicitly trains for one-step distribution matching vs. simply cutting short the ODE solver (which leads to severe performance drops)
+ 1 more
vs. Defusion [not cited in paper]: Defusion also accelerates diffusion but typically via distinct fast-solver techniques rather than set-based IMLE distillation

Limitations

Reliance on the teacher's quality; the student cannot outperform the teacher's distribution coverage
Fixed horizon prediction (H=32) may limit applicability to very long-horizon tasks without re-planning
Requires generating a large offline synthetic dataset from the teacher before student training

Reproducibility

Code: https://sites.google.com/view/flow2one

Publicly available code at https://sites.google.com/view/flow2one. The paper specifies architecture (ResNet-18, PointNet, U-Net) and hyperparameters (epochs, horizon) clearly.

📊 Experiments & Results

Evaluation Setup

Robotic manipulation tasks in simulation (RLBench) and real-world execution

Benchmarks:

RLBench (Robotic Manipulation (8 tasks including Reach Target, Push Button, etc.))
Real-World Manipulation (Dynamic object interaction) [New]

Metrics:

Success Rate (SR)
Inference Speed (Hz)
Statistical methodology: Averaged over 3 independent runs, each with 300 evaluation episodes

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Simulation results on RLBench showing the student maintains high performance while significantly increasing speed compared to the teacher and baselines.
RLBench	Success Rate	16.3	68.6	+52.3
RLBench	Success Rate	74.1	68.6	-5.5
RLBench	Success Rate	1.8	68.6	+66.8
RLBench	Inference Speed (Hz)	8.6	123.5	+114.9
Real-world experiments demonstrating robustness and speed in physical deployment.
Real Robot Tasks	Success Rate	Not reported in the paper	70.0	Not reported in the paper
Real Robot Tasks	Inference Speed (Hz)	2.9	125.0	+122.1

Experiment Figures

Qualitative visualization of successful task executions across 8 RLBench tasks.

Main Takeaways

Single-step distillation via IMLE preserves multi-modal action distributions where consistency/MSE objectives fail (evidenced by 68.6% vs 16.3% SR)
High-frequency inference (125 Hz) enables robust handling of dynamic disturbances that cause slower teachers (2.9 Hz) to fail
The set-level Chamfer loss effectively balances mode coverage and fidelity, preventing the student from averaging conflicting trajectories

📚 Prerequisite Knowledge

Prerequisites

Flow Matching / Diffusion Models
Implicit Maximum Likelihood Estimation (IMLE)
Behavior Cloning
Robotic Perception (PointNet, ResNet)

Key Terms

CFM: Conditional Flow Matching—a generative modeling technique that learns a velocity field to transform noise into data distributions via ODE integration

IMLE: Implicit Maximum Likelihood Estimation—a method for training generative models that matches generated samples to data samples without explicit likelihood evaluation, often preventing mode collapse

Chamfer distance: A metric measuring the similarity between two point sets by summing the distances from each point in one set to its nearest neighbor in the other

ODE integration: The process of solving Ordinary Differential Equations step-by-step to generate samples in diffusion/flow models, which is computationally expensive

Mode collapse: A failure case where a generative model produces limited varieties of samples or averages diverse modes into a single, often invalid, mean

FiLM: Feature-wise Linear Modulation—a technique to condition neural networks by applying affine transformations to feature maps

Proprioception: The robot's internal sense of its own joint positions and velocities