← Back to Paper List

GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs

Pu Hua, Minghuan Liu, Annabella Macaluso, Yunfeng Lin, Weinan Zhang, Huazhe Xu, Lirui Wang
Institute for Interdisciplinary Information Sciences, Tsinghua University, University of California San Diego, Shanghai Jiao Tong University, Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory
Conference on Robot Learning (2024)
MM Agent RL Benchmark

📝 Paper Summary

Synthetic Data Generation Sim-to-Real Transfer Robot Manipulation
GenSim2 autonomously generates diverse articulated robotic tasks and demonstrations by leveraging multi-modal and reasoning LLMs to design solvers, enabling robust sim-to-real transfer via a point-cloud policy.
Core Problem
Scaling robotic simulation is bottlenecked by the human effort required to design complex articulated tasks and valid solvers, while existing sim-to-real methods often fail to generalize across diverse tasks.
Why it matters:
  • Real-world robot data collection is expensive and unscalable compared to simulation
  • Manual creation of simulation assets and task logic limits the diversity needed for generalizable policies
  • Existing generative simulation methods (like RoboGen) struggle with the complexity of articulated objects and precise contact-rich motions
Concrete Example: In a task like 'opening a box,' a text-only LLM might generate code that misses the box lid's specific geometry or joint limits. GenSim2 uses a multi-modal LLM (GPT-4V) to inspect the rendered scene, identify keypoints, and generate precise motion constraints for a solver.
Key Novelty
Visual-Feedback Solver Generation & Reasoning-Enhanced Task Proposal
  • Uses Multi-modal LLMs (GPT-4V) to iteratively generate and verify constraints for a keypoint-based motion planner (kPAM) by 'seeing' the simulation assets
  • Leverages Reasoning LLMs (OpenAI o1) to decompose long-horizon tasks into solvable sub-tasks with higher logical consistency than vanilla LLMs
  • Distills generated data into a Proprioceptive Point-cloud Transformer (PPT) policy designed specifically to bridge the sim-to-real gap using geometry
Architecture
Architecture Figure Figure 4
The Proprioception Point-cloud Transformer (PPT) policy architecture used for robot inference.
Evaluation Highlights
  • GenSim2-generated data co-trained with real data improves real-world success rates by +21.2% (0.575 vs 0.363) compared to training on real data alone
  • Achieves 0.60 solution rate on generated long-horizon tasks using reasoning LLMs (o1), significantly outperforming the RoboGen baseline (0.43)
  • Primitive task generation achieves 0.78 solution rate, surpassing RoboGen's 0.58 on comparable sub-tasks
Breakthrough Assessment
8/10
Significant advance in automated robotic data generation. Successfully integrates VLM feedback for motion planning (solving a key reliability issue in generative sim) and demonstrates strong sim-to-real results.
×