← Back to Paper List

EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling

Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su, You Zhou, Xuefei Zhe, Naoya Iwamoto, Bo Zheng, Michael J. Black
The University of Tokyo, Keio University, Max Planck Institute for Intelligent Systems, Tsinghua University
Computer Vision and Pattern Recognition (2023)
MM Speech Benchmark

📝 Paper Summary

Co-Speech Gesture Generation 3D Human Motion Synthesis Cross-Modal Representation Learning
EMAGE generates synchronized full-body and facial gestures from audio by utilizing masked gesture modeling to encode body hints and a unified mesh-level dataset (BEAT2).
Core Problem
Existing co-speech generation lacks a unified, high-quality mesh-level dataset (hampering training with vertex losses) and fails to effectively coordinate holistic body parts (face, hands, body) when generating from audio.
Why it matters:
  • Current datasets use incompatible formats (skeleton vs. blendshapes), preventing unified training for full-body digital humans
  • Models often suffer from 'mean pose' regression or lack diversity because they don't separate body-part dynamics (e.g., face vs. lower body correlation with audio differs)
  • Partial gesture completion (infilling specific frames while respecting audio) is difficult for autoregressive models
Concrete Example: A digital avatar might need to wave while speaking, but standard audio-to-gesture models ignore the 'wave' constraint and generate generic arm movements. EMAGE can take the specific 'wave' frames as a masked input and generate the rest of the motion synchronized to speech.
Key Novelty
Masked Audio-Conditioned Gesture Modeling with Compositional Quantization
  • Introduces BEAT2, a standardized dataset converting diverse mocap data into unified SMPL-X body and FLAME head parameters for mesh-level training
  • Uses a masked transformer to learn bidirectional dependencies between audio and gestures, allowing the model to fill in missing motion based on sparse 'seed' gestures
  • Decodes motion using four separate VQ-VAEs (face, upper body, hands, lower body) to capture the distinct dynamic patterns and audio-correlations of each body part
Evaluation Highlights
  • Achieves lowest FGD (Fréchet Gesture Distance) of 4.88 on BEAT2, significantly outperforming CaMN (8.66) and TalkSHOW (7.87)
  • Outperforms baselines on diversity metrics (BeatAlign), scoring 0.81 compared to TalkSHOW's 0.76
  • Generates stable motion with significantly lower foot sliding (1.23 cm) compared to baselines like Habibie et al. (2.42 cm)
Breakthrough Assessment
8/10
Significant contribution via the BEAT2 dataset standardization, which resolves a major hurdle in the field. The masked modeling approach effectively unifies generation and completion tasks.
×