MaDiS: Taming Masked Diffusion Language Models for Sign Language Generation

📝 Paper Summary

Sign Language Generation (SLG) Masked Diffusion Models Cross-modal Pretraining

MaDiS adapts masked diffusion language models for sign language generation by introducing tri-level pretraining and a temporal-checkpoint unmasking strategy to enable bidirectional context modeling and faster inference.

Core Problem

Existing autoregressive language models (ARLMs) for sign language generation are limited by unidirectional (left-to-right) context modeling and slow token-by-token serial inference.

Why it matters:

Left-to-right generation fails to capture future context crucial for sign language grammar and motion planning
Serial decoding creates an inference bottleneck, hindering real-time applications for Deaf and Hard-of-Hearing communities
Current methods lack grounded pretraining, missing the 3D physical nature of sign motions

Concrete Example: Generating a 100-token sign sequence with an ARLM requires 100 sequential steps. MaDiS can generate the same sequence in ~25 steps by sampling multiple tokens in parallel, while correcting early errors using bidirectional context.

Key Novelty

Masked Diffusion Language Model (MDLM) for Sign Language

Replaces autoregressive decoding with a bidirectional masked diffusion process, allowing the model to predict any token based on any context (past or future)
Introduces 'unmasking with temporal checkpoints' (UTC) to prune the vast search space of diffusion steps, enforcing coarse-to-fine generation
Tri-level pretraining forces the model to learn signs not just as tokens, but also as latent codebook features and physical 3D motions simultaneously

Architecture

Overview of MaDiS pipeline including Tri-Level Pretraining and Fine-tuning stages.

Evaluation Highlights

Achieves state-of-the-art DTW-JPE error of 6.22 on Phoenix-2014T, outperforming the previous best (SOKE) by 0.54 points
Reduces inference latency by ~30% compared to autoregressive baselines (e.g., 6.36s vs 9.20s on CSL-Daily)
Improves text-to-sign retrieval (R@1) by over 3.0 points on CSL-Daily using the new SiCLIP metric

Breakthrough Assessment

8/10

First successful application of MDLMs to sign language generation. The method significantly improves both quality and speed, addressing the core bottleneck of autoregressive approaches.

⚙️ Technical Details

Problem Definition

Setting: Text-to-Sign Motion Generation

Inputs: Natural language text sequence

Outputs: Sequence of 3D sign motions (SMPL-X parameters)

Pipeline Flow

Sign Tokenizer (VQ-VAE encodes motion → discrete tokens)
Tri-Level Pretraining (Token + Latent + Physical objectives)
Supervised Fine-Tuning (Text conditioning + UTC unmasking)
Inference (Parallel iterative decoding)

System Modules

Sign Tokenizer

Converts continuous 3D motion into discrete token sequences for the language model

Model or implementation: Decoupled VQ-VAE (frozen during MDLM training)

MDLM Backbone (Generation)

Predicts masked sign tokens conditioned on text and unmasked sign tokens

Model or implementation: Qwen3-0.6B-Base (with non-causal attention mask)

MoP Embedding Layer (Generation)

Fuses embeddings from different body parts (hands, body) using learnable gating weights

Model or implementation: MLP + Softmax gating

Novel Architectural Elements

Tri-level pretraining head architecture: Simultaneously predicts tokens (Cross-Entropy), latent codebook features (L1), and physical 3D motions (L1 via frozen VAE decoder)
Mixture-of-Parts (MoP) embedding layer: Dynamically weights contribution of different body part tokens (hands vs body) using a learned gating mechanism based on VQ-VAE codebooks

Modeling

Base Model: Qwen3-0.6B-Base

Training Method: Masked Diffusion with Multi-Task Pretraining and Fine-tuning

Objective Functions:

Purpose: Token-level prediction accuracy.

Formally: Cross-entropy loss on masked tokens
Purpose: Latent-space feature alignment.

Formally: Smoothed L1 loss between predicted embeddings and VQ-VAE codebook vectors
Purpose: Physical-space motion grounding.

Formally: L1 loss between reconstructed motions (via VAE decoder) and ground truth motions

Training Data:

Pretraining: Combined set of CSL-Daily (20K), Phoenix-2014T (8K), How2Sign (35K)
Fine-tuning: Individual datasets

Key Hyperparameters:

learning_rate: 2e-4
batch_size: 64
pretraining_epochs: 200
+ 3 more
finetuning_epochs: 150
diffusion_steps_k: 4 (empirically set)
sequence_length_M: 100 tokens (approx 400 frames)

Compute: 4x Nvidia GH200 GPUs. Inference latency ~6-7s per sentence.

Comparison to Prior Work

vs. SOKE: MaDiS uses masked diffusion (bidirectional) vs. autoregressive (unidirectional); supports parallel decoding vs. serial.
vs. MoMask++: MaDiS uses theoretically grounded MDLM masking/unmasking vs. heuristic masking schedules; includes text-to-sign specific pretraining.
vs. MDLM (Standard): MaDiS adds physical/latent space objectives and constrained 'temporal checkpoint' unmasking to accelerate convergence.

Limitations

Requires significantly more training epochs than ARLMs to converge without the UTC strategy
Reliance on an external VQ-VAE tokenizer means performance is upper-bounded by tokenizer reconstruction quality
Generating very long sequences requires truncation or sliding windows due to fixed context length

Reproducibility

Code: https://github.com/TencentYoutuResearch/MaDiS

Code and models will be released. Uses open-source Qwen3-0.6B and SMPL-X models. Datasets are standard public benchmarks (CSL-Daily, Phoenix-2014T, How2Sign).

📊 Experiments & Results

Evaluation Setup

Text-to-Sign Motion Generation evaluated on reconstruction quality and semantic alignment

Benchmarks:

CSL-Daily (Chinese Sign Language Generation)
Phoenix-2014T (German Sign Language Generation (Weather domain))
How2Sign (American Sign Language Generation (Open domain))

Metrics:

DTW-JPE (lower is better)
SiBLEU-4 (higher is better)
SiCLIP R@1 (higher is better)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison shows MaDiS outperforming state-of-the-art methods across all three datasets on the primary DTW-JPE metric.
CSL-Daily	DTW-JPE	7.96	7.52	-0.44
Phoenix-2014T	DTW-JPE	6.76	6.22	-0.54
How2Sign	DTW-JPE	7.33	6.79	-0.54
Ablation study demonstrates the cumulative value of adding latent and physical pretraining objectives.
CSL-Daily	DTW-JPE	7.88	7.52	-0.36
Inference latency comparison highlighting efficiency gains from parallel generation.
CSL-Daily	Inference Latency (s)	9.20	6.36	-2.84

Experiment Figures

Training loss and validation performance curves comparing UTC strategy vs. vanilla unmasking.

Visualization of learned gating weights in the Mixture-of-Parts layer.

Main Takeaways

MaDiS consistently outperforms autoregressive baselines (SOKE) in motion quality (DTW-JPE) and semantic alignment (SiCLIP), validating the MDLM approach.
Tri-level pretraining is crucial; predicting 3D physical motions directly aids the model more than just predicting tokens or latent features alone.
The UTC (temporal checkpoints) strategy accelerates convergence by pruning the unmasking order space by over 10^41 times, making training feasible.
Parallel decoding in MDLMs reduces inference latency by ~30% compared to serial autoregressive decoding.

📚 Prerequisite Knowledge

Prerequisites

Masked Diffusion Models (MDM)
Vector Quantized Variational Autoencoders (VQ-VAE)
Autoregressive Language Models (ARLM)

Key Terms

MDLM: Masked Diffusion Language Model—a generative model that learns to predict masked tokens in a sequence, enabling bidirectional context and parallel generation

UTC: Unmasking with Temporal Checkpoints—a strategy that enforces specific unmasking ratios at fixed diffusion steps (e.g., 75% masked at t=0.75) to prune the generation search space

MoP: Mixture-of-Parts—an embedding layer that dynamically fuses information from different body parts (hands, body) using learnable gates

DTW-JPE: Dynamic Time Warping over Joint Position Errors—a metric measuring the distance between generated and ground-truth motion sequences, aligned in time

SiBLEU: Sign BLEU—a proposed metric evaluating the overlap of quantized sign tokens between generated and ground-truth sequences

SiCLIP: Sign CLIP—a proposed retrieval-based metric measuring semantic alignment between generated sign motions and input text in a joint embedding space

SMPL-X: A parametric 3D body model that includes body, face, and hand parameters

VQ-VAE: Vector Quantized Variational Autoencoder—a model that compresses continuous data (like motions) into discrete tokens from a codebook