Soup-of-Tasks: A paradigm unifying different animation tasks (T2V, I2V, lip-sync) into a single model by treating them as variations of masked spatial-temporal reconstruction.
Soup-of-Modals: A mechanism to handle multiple input modalities (audio, text, image) by coupling them in a shared query but decoupling keys/values, then mixing them based on timestep importance.
Negative DPO: Negative Direct Preference Optimization—a training strategy that uses pairing-free negative samples (bad outputs) to penalize the model's tendency toward undesirable distributions without needing positive pairs.
CDCA: Coupled-Decoupled Multi-Modal Cross Attention—a module that shares queries across modalities but keeps keys/values specific, allowing precise multi-modal injection.
Multi-Modal PhDA: Multi-Modal Timestep Phase-aware Dynamic Allocation—a mechanism that adjusts the influence of different modalities (audio/text/image) depending on the diffusion noise level (timestep).
PNG: Phase-aware Negative classifier-free Guidance—an inference technique that applies weighted negative prompts at specific diffusion timesteps to suppress artifacts like unnatural gestures.
LVDM: Large-scale Video Diffusion Model—high-parameter models typically used for high-quality video generation but suffering from slowness.
CVDM: Compact Video Diffusion Model—smaller, faster models that typically trade off quality for speed.
EMA: Exponential Moving Average—a technique used here to gradually integrate weights from simpler tasks into the main model to prevent catastrophic forgetting.
FID: Fréchet Inception Distance—a metric for assessing the quality of generated images by comparing feature distributions.
FVD: Fréchet Video Distance—a metric for assessing the quality and temporal coherence of generated videos.