RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning

📝 Paper Summary

Real-world Robotic Manipulation Visuomotor Control

RL-100 unifies imitation and reinforcement learning under a shared PPO objective to fine-tune diffusion policies, achieving perfect success rates and high-frequency control via consistency distillation on real robots.

Core Problem

Supervised imitation learning is constrained by the quality of human demonstrations (imitation ceiling) and cannot correct failure modes or optimize for speed, while naive real-world RL is sample-inefficient and unsafe.

Why it matters:

High-quality real-robot data is scarce and expensive to collect, limiting the scalability of purely supervised methods
Teleoperation introduces latency and conservative motion biases, preventing robots from achieving super-human efficiency
Existing sim-to-real methods struggle with visual/dynamics gaps, and direct real-world RL often suffers from catastrophic forgetting or instability

Concrete Example: In the 'Box Folding' task, imitation-only policies fail (12-48% success) because they cannot recover from small misalignments in complex bimanual folding sequences. RL-100 recovers from these errors via RL fine-tuning, achieving 100% success.

Key Novelty

Unified RL-100 Framework (IL → Offline RL → Online RL)

Treats the diffusion denoising process as a multi-step decision process, allowing a unified PPO (Proximal Policy Optimization) surrogate objective to fine-tune the policy across both offline and online stages
Compresses the expensive multi-step diffusion policy into a one-step Consistency Model (CM) via distillation, enabling high-frequency control (10-20Hz) suitable for dynamic tasks without retraining

Evaluation Highlights

100% success rate across 1000 total evaluations on 8 real-world tasks (including Pouring, Unscrewing, and Folding), improving over the DP3 baseline (67.8% mean)
Continuous 7-hour operation of the 'Orange Juicing' robot in a public shopping mall with zero failures, demonstrating robustness in unstructured environments
Matches or exceeds human teleoperation efficiency, with the Consistency Model policy completing 'Box Folding' 1.57x faster than the DP-2D imitation baseline

Breakthrough Assessment

9/10

Achieving 100% success on 1000 real-world trials across diverse dynamic/deformable tasks is a massive reliability milestone. The integration of Consistency Models for latency reduction addresses a major bottleneck in diffusion robotics.

⚙️ Technical Details

Problem Definition

Setting: Real-world visuomotor control modeled as a Markov Decision Process (MDP) with continuous action spaces

Inputs: Visual observations o_t (Point clouds or RGB images) and proprioception q_t

Outputs: Action a_t (End-effector pose or joint positions)

Pipeline Flow

Visual Encoder (Processes Point Cloud/RGB)
Conditioning Fusion (Combines Vision + Proprioception)
Policy Network (Generates Action)
Execution (Robot Controller)

System Modules

Visual Encoder

Extract features from visual inputs (Point Cloud or RGB)

Model or implementation: PointNet (for 3D) or ResNet-like (for 2D) [Implied context]

RL-100 Actor (Diffusion/Consistency)

Generate action sequence conditioned on observations

Model or implementation: Conditional Diffusion Model (during training) / Consistency Model (during deployment)

Novel Architectural Elements

Deployment-time switch from Multi-step Diffusion Actor to One-step Consistency Policy via distillation
Self-supervised visual encoder specifically tailored for RL post-training to prevent representation drift

Modeling

Base Model: Diffusion Policy (U-Net or Transformer backbone)

Training Method: Three-stage pipeline: Imitation Learning → Iterative Offline RL → Online RL

Objective Functions:

Purpose: Clone human behavior to initialize the policy.

Formally: MSE Loss on noise prediction L_IL = E[||epsilon - epsilon_theta(a_t, t, c)||^2].
Purpose: Reinforce successful behaviors using a clipped surrogate objective.

Formally: PPO-style objective L_RL applied to the diffusion denoising steps.
Purpose: Compress diffusion policy for fast inference.

Formally: Consistency Distillation Loss L_CD minimizing distance between multi-step teacher and one-step student outputs.

Training Data:

Human Demonstrations: ~115 episodes per task (1.8 hours)
Offline Rollouts: ~566 episodes per task (6.5 hours)
Online Rollouts: ~434 episodes per task (5.6 hours)

Key Hyperparameters:

observation_horizon: 2 frames
action_chunk_size: 8-16 steps
point_cloud_size: 1024 points
+ 1 more
image_resolution: 128x128 (for RGB baseline)

Compute: Not reported in the paper

Comparison to Prior Work

vs. DP3: RL-100 adds RL post-training stages (Offline+Online) to imitation, achieving 100% vs 67.8% success
vs. SERL: RL-100 handles complex reorientation and long-horizon tasks (like Box Folding) without limiting wrist rotations, whereas SERL often constrains action spaces
vs. HIL-SERL: RL-100 focuses on autonomous refinement after imitation rather than continuous human-in-the-loop corrections during training

Limitations

Relies on human-defined sparse rewards or shaped rewards (like in Push-T), which requires instrumentation
Online RL stage requires physical resets, which can be labor-intensive without automated reset mechanisms
Evaluation is limited to rigid and deformable manipulation; does not cover navigation or mobile manipulation

Reproducibility

Code: https://lei-kun.github.io/RL-100/

Code and project website available at https://lei-kun.github.io/RL-100/. The paper details task setups, reward functions (e.g., Push-T shaped reward), and data collection budgets.

📊 Experiments & Results

Evaluation Setup

Real-world robotic evaluation on 8 tasks involving rigid bodies, deformable objects (towels, boxes), and fluids

Benchmarks:

Real-Robot Suite (Robotic Manipulation) [New]

Metrics:

Success Rate (%)
Time-to-Completion (s)
Environment Steps
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
RL-100 consistently outperforms imitation baselines (DP-2D, DP3) across all tasks, achieving perfect success rates after the online RL phase.
Box Folding	Success Rate (%)	48	100	+52
Pouring	Success Rate (%)	48	100	+52
Dynamic Unscrewing	Success Rate (%)	82	100	+18
Deployment efficiency metrics show RL-100 completes tasks significantly faster than baselines, largely due to Consistency Model (CM) distillation.
Box Folding	Wall-clock time (s)	65.1	41.4	-23.7
Dynamic Push-T	Episodes per unit time	17	20	+3

Main Takeaways

RL-100 achieves 100% success across 8 diverse real-world tasks, validating the effectiveness of the unified IL-to-RL pipeline
The iterative offline RL stage provides the bulk of performance improvement (e.g., 48% -> 96% on Box Folding), with online RL closing the final gap to 100%
Consistency Model distillation significantly reduces inference latency (1.57x faster wall-clock time on Box Folding), enabling high-frequency control without sacrificing success rates
Zero-shot robustness is high (90% average) against environmental shifts like lighting, friction, and visual distractors, suggesting the policy learns generalizable features

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, Actor-Critic)
Generative Models (Diffusion, Consistency Models)
Robotic Manipulation (SE(3) action spaces, Proprioception)

Key Terms

PPO: Proximal Policy Optimization—an RL algorithm that improves policies using a clipped surrogate objective to prevent destructively large updates

Diffusion Policy: A robot control policy that generates actions by iteratively denoising random noise, conditioned on observations

Consistency Model: A generative model distilled from diffusion that can generate samples (actions) in a single step, drastically reducing inference latency

DDIM: Denoising Diffusion Implicit Models—a sampling method for diffusion models that skips steps to speed up generation (but is still slower than Consistency Models)

Action Chunking: Predicting a sequence of k future actions at once rather than just the next immediate action, used to ensure temporal smoothness

OPE: Offline Policy Evaluation—methods to estimate the performance of a policy using historical data without running it on the real robot

Sim-to-real: Transferring policies trained in simulation to the real world; this paper focuses on Real-to-Real (training directly on hardware)

MDP: Markov Decision Process—the mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker