← Back to Paper List

Large Motion Model for Unified Multi-Modal Motion Generation

Mingyuan Zhang, Daisheng Jin, Chenyang Gu, Fangzhou Hong, Zhongang Cai, Jin Huang, Chongzhi Zhang, Xinying Guo, Lei Yang, Ying He, Ziwei Liu
European Conference on Computer Vision (2024)
MM Benchmark Pretraining

📝 Paper Summary

Human Motion Generation Multi-Modal Generative Models
LMM unifies diverse motion generation tasks into a single generalist model by consolidating datasets into MotionVerse and employing a body-part-aware attention mechanism to handle heterogeneous motion formats.
Core Problem
Existing motion generation models are specialists tailored to single tasks (e.g., text-to-motion) with incompatible data formats, preventing the scaling of motion knowledge across domains.
Why it matters:
  • Specialist models suffer from limited data quantity and narrow domains, leading to poor generalization.
  • Disparate motion formats (SMPL vs. keypoints) and frame rates make it difficult to leverage vast amounts of available motion data for a single unified model.
  • Transferring knowledge across tasks (e.g., music-to-dance to action-to-motion) is currently hindered by inconsistent problem formulations.
Concrete Example: A text-to-motion model trained on SMPL rotation data cannot utilize a music-to-dance dataset that uses sparse keypoint coordinates and different frame rates, preventing the model from learning general motion dynamics from the dance data.
Key Novelty
Large Motion Model (LMM) with MotionVerse Benchmark
  • Consolidates 16 datasets into MotionVerse, a unified benchmark with a standardized motion representation (TOMATO-based) that aligns varying skeletal structures and frame rates.
  • Introduces ArtAttention, a body-part-aware attention mechanism that decomposes the human body into 10 independent but coordinated parts to handle diverse topological requirements.
  • Uses a hybrid pre-training strategy with random frame rate augmentation and variable masking to learn robust motion patterns from heterogeneous data.
Evaluation Highlights
  • Consolidation of 16 distinct datasets into the MotionVerse benchmark, comprising 320k motion sequences.
  • Aggregation of 100 million total frames of motion data for large-scale pre-training.
  • Unification of 10 distinct motion tasks (including 7 standard and 3 new multi-modal tasks) under a single problem formulation.
Breakthrough Assessment
8/10
Proposes the first mega-scale unified benchmark and generalist model for motion generation, addressing the major bottleneck of data fragmentation in the field.
×