Large Motion Model for Unified Multi-Modal Motion Generation

📝 Paper Summary

Human Motion Generation Multi-Modal Generative Models

LMM unifies diverse motion generation tasks into a single generalist model by consolidating datasets into MotionVerse and employing a body-part-aware attention mechanism to handle heterogeneous motion formats.

Core Problem

Existing motion generation models are specialists tailored to single tasks (e.g., text-to-motion) with incompatible data formats, preventing the scaling of motion knowledge across domains.

Why it matters:

Specialist models suffer from limited data quantity and narrow domains, leading to poor generalization.
Disparate motion formats (SMPL vs. keypoints) and frame rates make it difficult to leverage vast amounts of available motion data for a single unified model.
Transferring knowledge across tasks (e.g., music-to-dance to action-to-motion) is currently hindered by inconsistent problem formulations.

Concrete Example: A text-to-motion model trained on SMPL rotation data cannot utilize a music-to-dance dataset that uses sparse keypoint coordinates and different frame rates, preventing the model from learning general motion dynamics from the dance data.

Key Novelty

Large Motion Model (LMM) with MotionVerse Benchmark

Consolidates 16 datasets into MotionVerse, a unified benchmark with a standardized motion representation (TOMATO-based) that aligns varying skeletal structures and frame rates.
Introduces ArtAttention, a body-part-aware attention mechanism that decomposes the human body into 10 independent but coordinated parts to handle diverse topological requirements.
Uses a hybrid pre-training strategy with random frame rate augmentation and variable masking to learn robust motion patterns from heterogeneous data.

Evaluation Highlights

Consolidation of 16 distinct datasets into the MotionVerse benchmark, comprising 320k motion sequences.
Aggregation of 100 million total frames of motion data for large-scale pre-training.
Unification of 10 distinct motion tasks (including 7 standard and 3 new multi-modal tasks) under a single problem formulation.

Breakthrough Assessment

8/10

Proposes the first mega-scale unified benchmark and generalist model for motion generation, addressing the major bottleneck of data fragmentation in the field.

⚙️ Technical Details

Problem Definition

Setting: Unified Multi-Modal Motion Generation

Inputs: Multi-modal control signals c (text, speech, music, video) and optional partial motion sequence x with visibility mask m

Outputs: Target motion sequence x in a unified intermediate representation

Pipeline Flow

Input Processing: Multi-modal Encoder (ImageBind) → Feature Embedding
Generation: Diffusion Transformer (LMM) with ArtAttention → Unified Motion Representation
Output Processing: Representation Translator → Dataset-Specific Motion Format

System Modules

Multi-modal Encoder

Encodes text, speech, music, or video inputs into a unified feature space

Model or implementation: ImageBind

Large Motion Model (LMM)

Generates motion sequences in the unified format based on conditioning

Model or implementation: Transformer-based Diffusion Model with ArtAttention

Representation Translator

Converts the model's unified output back to the specific format of the target dataset (e.g., SMPL rotations or keypoints)

Model or implementation: Learnable Mapping Network

Novel Architectural Elements

ArtAttention (Articulated Attention) mechanism: splits the attention computation into 10 independent body-part blocks (head, spine, arms, legs, hands, etc.) to allow partial control and handling of missing joints
Two-stage output pipeline: Generates into a unified 'MotionVerse' format first, then translates to specific dataset formats via learned translators

Modeling

Base Model: Transformer-based Diffusion Model

Training Method: Hybrid Pre-training (Unsupervised + Supervised)

Training Data:

MotionVerse benchmark: 16 datasets, 320k sequences, 100 million frames
Unified representation uses 52 joints (22 body + 30 hand) divided into 10 parts

Compute: Not reported in the paper

Comparison to Prior Work

vs. Specialists (MotionDiffuse, Bailando): LMM is a generalist model handling 10 tasks, whereas specialists handle only one.
vs. MDM: LMM uses ArtAttention to explicitly model body parts and handles multi-modal inputs via MotionVerse, rather than just text/action classes.
vs. OmniControl [not cited in paper]: OmniControl focuses on spatial control of T2M, while LMM focuses on cross-task and cross-modality generalization.

Limitations

Requires training specific representation translators for each target dataset format to enable evaluation.
The complexity of aligning 16 heterogeneous datasets into a single format (MotionVerse) is high.
No quantitative performance metrics (FID, R-Precision) provided in the text to verify 'competitive performance' claims.

Reproducibility

The text states code and models will be released, but no URL is provided in the snippet. Detailed hyperparameters (learning rate, batch size) and specific performance metrics are not included in the provided text.

📊 Experiments & Results

Evaluation Setup

Evaluation on the MotionVerse benchmark covering 10 tasks including Text-to-Motion, Music-to-Dance, and Motion Prediction.

Benchmarks:

MotionVerse (Unified Multi-Task Motion Generation) [New]

Metrics:

Not reported in the paper
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

The paper successfully consolidates the fragmented motion generation landscape into MotionVerse, a single benchmark with 320k sequences and 100M frames.
The proposed ArtAttention mechanism allows a single model to handle diverse body topologies and missing joint data inherent in multi-source datasets.
The unified problem formulation enables the definition of 3 new tasks: conditional motion prediction, conditional motion in-betweening, and multi-condition motion generation.
Note: Specific numeric performance results (e.g., FID scores) were not present in the provided text snippet, though the abstract claims competitive performance against state-of-the-art specialists.

📚 Prerequisite Knowledge

Prerequisites

Diffusion Models
Transformer Architecture
Human Motion Representations (SMPL, Keypoints)

Key Terms

MotionVerse: A large-scale, multi-modal, multi-task motion generation benchmark constructed by the authors, unifying 16 datasets

LMM: Large Motion Model—the authors' proposed transformer-based diffusion model for generalist motion generation

ArtAttention: Articulated Attention—a novel attention mechanism that models 10 body parts independently to handle missing joints and diverse topologies

TOMATO: A unified motion representation format used as an intermediary to align diverse dataset formats (based on prior work)

ImageBind: A pre-trained multi-modal encoder used to embed diverse control signals (audio, text, video) into a shared feature space

SMPL: Skinned Multi-Person Linear model—a realistic 3D human body model often used for motion annotations