_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
MMDiT: Multimodal Diffusion Transformer—a neural network architecture that processes multiple modalities (like audio and video) using separate branches that communicate via attention mechanisms
RLHF: Reinforcement Learning from Human Feedback—a training method where a model is fine-tuned to maximize a reward signal derived from human preferences
SFT: Supervised Fine-Tuning—training the model on high-quality labeled datasets to establish baseline capabilities before RLHF
NFE: Number of Function Evaluations—the number of times the model's neural network is called during the generation process; fewer evaluations mean faster generation
GSB: Good-Same-Bad—a comparative evaluation metric where raters decide if one model's output is better, the same, or worse than another's
T2VA: Text-to-Video-Audio—generating both video and audio from a text description
I2VA: Image-to-Video-Audio—generating video and audio starting from a reference image
Dolly Zoom: A cinematic effect where the camera moves closer/further while zooming in the opposite direction, creating a warping perspective
Distillation: A compression technique where a smaller or faster model (student) learns to mimic the behavior of a larger or more complex model (teacher)
Orchid hand gesture: A stylized hand gesture used in traditional Chinese opera, used here to demonstrate the model's capability in generating culturally specific nuances
Nianbai: A form of spoken dialogue in Chinese opera, distinct from singing, used to test the model's handling of specific vocal styles