← Back to Paper List

RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning

Kun Lei, Huanyu Li, Dongjie Yu, Zhenyu Wei, Lingxiao Guo, Zhennan Jiang, Ziyu Wang, Shiyu Liang, Huazhe Xu
Shanghai Qi Zhi Institute, School of Computer Science, Shanghai Jiao Tong University, Institute for Interdisciplinary Information Sciences, Tsinghua University
arXiv (2025)
RL MM Benchmark

📝 Paper Summary

Real-world Robotic Manipulation Visuomotor Control
RL-100 unifies imitation and reinforcement learning under a shared PPO objective to fine-tune diffusion policies, achieving perfect success rates and high-frequency control via consistency distillation on real robots.
Core Problem
Supervised imitation learning is constrained by the quality of human demonstrations (imitation ceiling) and cannot correct failure modes or optimize for speed, while naive real-world RL is sample-inefficient and unsafe.
Why it matters:
  • High-quality real-robot data is scarce and expensive to collect, limiting the scalability of purely supervised methods
  • Teleoperation introduces latency and conservative motion biases, preventing robots from achieving super-human efficiency
  • Existing sim-to-real methods struggle with visual/dynamics gaps, and direct real-world RL often suffers from catastrophic forgetting or instability
Concrete Example: In the 'Box Folding' task, imitation-only policies fail (12-48% success) because they cannot recover from small misalignments in complex bimanual folding sequences. RL-100 recovers from these errors via RL fine-tuning, achieving 100% success.
Key Novelty
Unified RL-100 Framework (IL → Offline RL → Online RL)
  • Treats the diffusion denoising process as a multi-step decision process, allowing a unified PPO (Proximal Policy Optimization) surrogate objective to fine-tune the policy across both offline and online stages
  • Compresses the expensive multi-step diffusion policy into a one-step Consistency Model (CM) via distillation, enabling high-frequency control (10-20Hz) suitable for dynamic tasks without retraining
Evaluation Highlights
  • 100% success rate across 1000 total evaluations on 8 real-world tasks (including Pouring, Unscrewing, and Folding), improving over the DP3 baseline (67.8% mean)
  • Continuous 7-hour operation of the 'Orange Juicing' robot in a public shopping mall with zero failures, demonstrating robustness in unstructured environments
  • Matches or exceeds human teleoperation efficiency, with the Consistency Model policy completing 'Box Folding' 1.57x faster than the DP-2D imitation baseline
Breakthrough Assessment
9/10
Achieving 100% success on 1000 real-world trials across diverse dynamic/deformable tasks is a massive reliability milestone. The integration of Consistency Models for latency reduction addresses a major bottleneck in diffusion robotics.
×