Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

📝 Paper Summary

Audio-driven Co-speech Gesture Generation Conditional Diffusion Models

DiffGesture generates high-fidelity, audio-aligned co-speech gestures using a diffusion model conditioned on audio and initial poses, featuring a specialized stabilizer to ensure smooth temporal motion.

Core Problem

Existing GAN-based methods for co-speech gesture generation suffer from mode collapse and unstable training, failing to capture complex audio-gesture distributions.

Why it matters:

Animating virtual avatars with realistic co-speech gestures is crucial for human-machine interaction and embodied AI
Current GAN approaches often produce dull, repetitive, or unreasonable poses due to difficulty in learning the high-fidelity distribution
Naive application of diffusion models to sequential data introduces temporal jitter (inconsistency) because standard denoising adds independent noise per frame

Concrete Example: When generating gestures for a long speech clip, a standard diffusion model might produce diverse poses for each individual frame that, when played together, look jittery or incoherent because the noise sampling ignores temporal continuity. GANs often collapse to a 'mean pose' that looks safe but lacks expressive variation.

Key Novelty

Diffusion Co-Speech Gesture (DiffGesture)

Models the gesture generation as a conditional diffusion process on entire skeleton sequences, treating gesture clips as the latent space
Uses a Diffusion Audio-Gesture Transformer to attend to multi-modal context (audio, initial poses) across long temporal dependencies
Introduces a Diffusion Gesture Stabilizer that anneals noise variance over time during sampling to eliminate temporal jitter without retraining

Architecture

Overview of the DiffGesture framework including the forward diffusion process, the reverse conditional denoising transformer, and the stabilizer sampling.

Evaluation Highlights

Achieves state-of-the-art Fréchet Gesture Distance (FGD) of 1.506 on TED Gesture, significantly outperforming the previous best (HA2G: 3.072)
Improves FGD on TED Expressive to 2.600 compared to HA2G's 5.306, demonstrating better handling of complex finger movements
Enhances diversity on TED Expressive (182.757 vs HA2G's 173.899) while maintaining high beat consistency (BC)

Breakthrough Assessment

8/10

First successful adaptation of diffusion models to the specific challenges of co-speech gesture generation, solving the critical temporal consistency issue inherent to frame-wise noise sampling.

⚙️ Technical Details

Problem Definition

Setting: Generate a sequence of human skeletons conditioned on speech audio and initial poses

Inputs: Speech audio sequence a and initial poses p (first M frames)

Outputs: Human skeleton sequence x (N frames) aligned with audio

Pipeline Flow

Audio Encoder (extracts features)
Diffusion Audio-Gesture Transformer (denoises gesture sequence conditioned on audio)
Diffusion Gesture Stabilizer (sampling strategy)

System Modules

Audio Encoder

Extract features from raw audio clips

Model or implementation: 1D-CNN (same as Trimodal)

Diffusion Audio-Gesture Transformer

Predict the noise added to the gesture sequence at timestep t, attending to audio and initial pose context

Model or implementation: Transformer Decoder (8 layers, 256/512 hidden dim)

Diffusion Gesture Stabilizer

Modify the noise sampling process during inference to ensure temporal smoothness

Model or implementation: Algorithm (Thresholding or Smooth Sampling)

Novel Architectural Elements

Concatenation of skeleton frames and context features as individual tokens in a non-autoregressive Transformer
Diffusion Gesture Stabilizer module purely at inference time to enforce temporal coherence via annealed noise

Modeling

Base Model: Transformer (8 blocks, 4 heads)

Training Method: Conditional DDPM training with implicit classifier-free guidance

Objective Functions:

Purpose: Minimize reconstruction error of the noise.

Formally: MSE between true noise epsilon and predicted noise epsilon_theta(x_t, c, t).

Training Data:

TED Gesture (1,766 videos, 10 upper body joints)
TED Expressive (higher fidelity, 43 joints including fingers)

Key Hyperparameters:

timesteps: 500
beta_schedule: linear from 1e-4 to 0.02
learning_rate: 5e-4
+ 2 more
batch_size: Not reported in the paper
transformer_hidden_dim: 256 (TED Gesture), 512 (TED Expressive)

Compute: 10 hours (TED Gesture) / 20 hours (TED Expressive) on single NVIDIA RTX 3090

Comparison to Prior Work

vs. HA2G: DiffGesture uses diffusion instead of GANs, avoiding mode collapse and achieving better distribution coverage (lower FGD)
vs. Trimodal: DiffGesture uses a non-autoregressive transformer on full sequences rather than autoregressive RNN/Transformer, preventing error accumulation
vs. Listen-Denoise-Action [not cited in paper]: LDA uses diffusion for general motion; DiffGesture specifically tackles co-speech audio alignment and temporal jitter via the Stabilizer

Limitations

Inference speed is slower than GAN-based methods due to iterative diffusion sampling (500 steps)
Requires handling of temporal inconsistency which is not natively solved by standard DDPMs
Performance depends on the quality of extracted pose annotations (OpenPose/ExPose)

Reproducibility

Code: https://github.com/Advocate99/DiffGesture

Code publicly available. Uses standard datasets (TED Gesture, TED Expressive). Audio encoder and pose representation follow prior work (Trimodal, HA2G).

📊 Experiments & Results

Evaluation Setup

Generate gestures from held-out audio clips; compare against ground truth and baselines

Benchmarks:

TED Gesture (Upper body gesture generation (10 joints))
TED Expressive (Full upper body + finger gesture generation (43 joints))

Metrics:

Fréchet Gesture Distance (FGD)
Beat Consistency Score (BC)
Diversity (Mean feature distance)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
State-of-the-art comparison on TED Gesture dataset showing significant improvements in realism (FGD) and diversity.
TED Gesture	FGD	3.072	1.506	-1.566
TED Gesture	Diversity	104.322	106.722	+2.40
Results on TED Expressive dataset, which involves complex finger movements.
TED Expressive	FGD	5.306	2.600	-2.706
TED Expressive	Diversity	173.899	182.757	+8.858
TED Expressive	BC	0.641	0.718	+0.077
Ablation studies validating the necessity of the Stabilizer and Classifier-free guidance.
TED Expressive	FGD	2.792	2.600	-0.192
TED Expressive	FGD	3.326	2.600	-0.726

Experiment Figures

Qualitative visualization of generated gestures compared to baselines.

Main Takeaways

DiffGesture significantly outperforms GAN-based baselines (HA2G, Trimodal) in FGD, indicating much higher realism and distribution coverage.
The Diffusion Gesture Stabilizer is critical: removing it degrades performance, confirming that standard diffusion sampling causes temporal inconsistency in motion.
The Transformer architecture outperforms GRU-based diffusion backbones (Table 4), validating the choice of non-autoregressive sequence modeling.
Implicit classifier-free guidance allows trading off diversity for quality, yielding better beat consistency and fidelity than unguided models.

📚 Prerequisite Knowledge

Prerequisites

Denoising Diffusion Probabilistic Models (DDPMs)
Transformer architecture (Self-Attention)
Classifier-free guidance

Key Terms

FGD: Fréchet Gesture Distance—a metric measuring the distance between the distribution of generated gestures and real gestures; lower is better

BC: Beat Consistency—measures how well the generated gesture beats align with the audio beats; higher is better

DDPM: Denoising Diffusion Probabilistic Models—generative models that learn to reverse a gradual noise-addition process to generate data

classifier-free guidance: A technique to control conditional generation by jointly training a conditional and unconditional model and interpolating their outputs during sampling

annealed noise sampling: A strategy where the variance of the sampled noise is gradually reduced (cooled) over timesteps to improve stability

skeleton sequence: A time-series of vectors representing the 2D or 3D coordinates/angles of human body joints

temporal inconsistency: Jittery or jagged motion in generated video/animation caused by independent errors in consecutive frames