_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
MLLM: Multi-modal Large Language Model—AI models capable of processing and generating text based on multiple input modalities like images, video, and audio
Video-MME: Video Multi-Modal Evaluation—the specific benchmark proposed in this paper
Certificate Length: The minimum total duration of video sub-clips required to verify that an answer to a question is correct; used as a metric for temporal difficulty
Gemini 1.5 Pro: A commercial MLLM from Google known for its large context window capabilities
GPT-4o: A commercial multimodal model from OpenAI
InternVL-Chat-V1.5: An open-source MLLM designed for image and video understanding
LLaVA-NeXT-Video: An open-source MLLM specifically optimized for video tasks
QA: Question Answering
Visual Domains: Categories of video content, such as Knowledge, Sports, or Film
Temporal Dynamics: Changes and interactions occurring over time within a video sequence