MDLM: Masked Diffusion Language Model—a generative model that learns to predict masked tokens in a sequence, enabling bidirectional context and parallel generation
UTC: Unmasking with Temporal Checkpoints—a strategy that enforces specific unmasking ratios at fixed diffusion steps (e.g., 75% masked at t=0.75) to prune the generation search space
MoP: Mixture-of-Parts—an embedding layer that dynamically fuses information from different body parts (hands, body) using learnable gates
DTW-JPE: Dynamic Time Warping over Joint Position Errors—a metric measuring the distance between generated and ground-truth motion sequences, aligned in time
SiBLEU: Sign BLEU—a proposed metric evaluating the overlap of quantized sign tokens between generated and ground-truth sequences
SiCLIP: Sign CLIP—a proposed retrieval-based metric measuring semantic alignment between generated sign motions and input text in a joint embedding space
SMPL-X: A parametric 3D body model that includes body, face, and hand parameters
VQ-VAE: Vector Quantized Variational Autoencoder—a model that compresses continuous data (like motions) into discrete tokens from a codebook