_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that optimizes a policy by comparing outcomes within a group of samples rather than using a learned value function critic
KV cache: Key-Value cache—stored intermediate representations in Transformer models used to avoid recomputing attention for past tokens during generation
Top-K ranking: A selection strategy that keeps only the K items with the highest assigned scores
DiT: Diffusion Transformer—a video generation architecture replacing the U-Net with a Transformer backbone
Autoregressive: Generating data sequentially, where each new piece depends on previously generated pieces
Plackett-Luce model: A probabilistic model for ranking items, used here to sample an ordered list of context tokens based on predicted relevance scores
CsVBench: Cross-scene Video Benchmark—a new benchmark proposed in this paper containing multi-scene prompts with shared subjects
ArcFace: A face recognition model used here to compute identity consistency rewards
CLIP: Contrastive Language-Image Pre-training—a model used to measure how well generated images match text prompts
VLM: Vision-Language Model—used here as an artifact detector to penalize low-quality generations