← Back to Paper List

Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning

Zhecheng Yuan, Tianming Wei, Shuiqi Cheng, Gu Zhang, Yuanpei Chen, Huazhe Xu
Institute for Interdisciplinary Information Sciences, Tsinghua University, Shanghai Qi Zhi Institute, Shanghai Jiao Tong University, The University of Hong Kong, Peking University, Shanghai Artificial Intelligence Laboratory
Conference on Robot Learning (2024)
RL MM

📝 Paper Summary

Visual Reinforcement Learning Sim-to-Real Transfer Generalizable Robotic Manipulation
Maniwhere enables robots to generalize across diverse visual disturbances (viewpoints, appearances, backgrounds) by combining multi-view contrastive learning, spatial transformer networks, and curriculum-based randomization.
Core Problem
Robotic policies trained in simulation often fail in the real world due to visual discrepancies like camera shifts, lighting changes, or background clutter, requiring tedious recalibration.
Why it matters:
  • Immovable or disturbed cameras in real-world setups can render trained policies useless, halting progress.
  • Existing methods typically address single types of generalization (e.g., only appearance) but fail when multiple disturbances (viewpoint + appearance) occur simultaneously.
  • Naively applying heavy data augmentation to fix this often destabilizes Reinforcement Learning (RL) training, leading to policy divergence.
Concrete Example: A robot arm trained to pick up a mug might fail completely if a lab mate accidentally bumps the camera tripod slightly, changing the viewpoint, or if the table color changes.
Key Novelty
Multi-view Representation Learning with Spatial Transformers (Maniwhere)
  • Trains the visual encoder using images from two cameras (one fixed, one moving) to force the learning of view-invariant features via contrastive loss.
  • Integrates a Spatial Transformer Network (STN) module within the encoder to actively transform feature maps, enhancing spatial awareness and robustness to view shifts.
  • Uses a curriculum-based randomization strategy that gradually increases noise levels, preventing the RL agent from destabilizing early in training.
Architecture
Architecture Figure Figure 1
The overall framework of Maniwhere. It depicts the data flow from simulation (returning Fixed and Random views), the Visual Encoder with STN, the Multi-View Representation Learning objectives (Contrastive + Alignment), and the RL training loop with Curriculum Randomization.
Evaluation Highlights
  • Outperforms MV-MWM by +68.6% on average across 8 simulated tasks involving view generalization.
  • Achieves zero-shot sim-to-real transfer on 3 different hardware setups (UR5 arm, Allegro Hand, Leap Hand) without real-world fine-tuning.
  • Maintains high success rates even when transferring to a completely different robot body (UR5e to Franka arm) in simulation.
Breakthrough Assessment
8/10
Strong empirical results demonstrating simultaneous generalization across viewpoints, appearances, and embodiments. The zero-shot sim-to-real transfer on complex dexterous hand tasks is particularly impressive.
×