Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

📝 Paper Summary

Multi-modal Large Language Models (MLLMs) Video Understanding Benchmarks

Video-MME is a comprehensive benchmark for evaluating multi-modal large language models on video analysis tasks across diverse domains, varying durations (11s to 1 hour), and multiple modalities (video, audio, subtitles).

Core Problem

Existing video benchmarks for Multi-modal Large Language Models (MLLMs) lack diversity in video types, fail to cover varying temporal durations (especially long videos), and often ignore audio/subtitle modalities.

Why it matters:

Current evaluations focus heavily on static images or short clips, failing to test MLLMs on real-world long-form sequential data
Ignoring audio and subtitles limits the assessment of a model's true multimodal understanding capabilities
The lack of high-quality, diverse annotations for long videos hinders the development of models capable of complex temporal reasoning

Concrete Example: A question asking for the 'departure date' in a travel vlog might require reading a text overlay ('May 31') and listening to audio narration ('Day 1') simultaneously. Existing benchmarks might only provide the visual frames, making the question unanswerable, or use short clips where such context is cut off.

Key Novelty

Video-MME (Multi-Modal Evaluation benchmark)

Constructs a dataset of 900 videos spanning varied durations (short, medium, long) and 6 diverse domains to test generalization
Integrates multi-modal inputs explicitly: evaluates performance with and without subtitles/audio to measure their contribution
Uses 'certificate length' analysis to ensure questions require digesting significant portions of the video, preventing shortcuts

Architecture

The data construction statistics and hierarchy, illustrating the domain distribution and duration breakdown.

Evaluation Highlights

Gemini 1.5 Pro achieves 81.3% accuracy with subtitles, significantly outperforming GPT-4o (77.2%) and open-source models
Integrating subtitles and audio boosts Gemini 1.5 Pro's performance by 6.2% and 4.3% respectively, with larger gains in longer videos
Performance drops as video length increases: Gemini 1.5 Pro drops from 81.7% on short videos to 67.4% on long videos (without subtitles)

Breakthrough Assessment

9/10

Sets a new standard for video MLLM evaluation by addressing the critical gap in long-context and multi-modal (audio/subtitle) assessment. The rigorous manual annotation and 'certificate length' validation make it highly robust.

⚙️ Technical Details

Problem Definition

Setting: Video Question Answering (VideoQA) with multiple-choice questions

Inputs: Video sequence V (frames), Audio track A, Subtitles S (text), Question Q, Candidate Options O

Outputs: Predicted option index (A, B, C, or D)

Pipeline Flow

Data Collection: Domain Hierarchy → Video Sourcing
Annotation: Manual QA Creation → Review & Filtering
Evaluation: Model Inference → Accuracy Calculation

System Modules

Video Collector

Source videos across 6 domains and 30 subfields from YouTube, ensuring duration diversity (Short/Medium/Long)

Model or implementation: N/A (Manual/Scripted Collection)

Annotator (Annotation)

Create 3 multiple-choice questions per video based on full content viewing

Model or implementation: Human Experts

Quality Reviewer (Annotation)

Filter out low-quality questions and those solvable by text-only (using Gemini 1.5 Pro)

Model or implementation: Human + Gemini 1.5 Pro

Novel Architectural Elements

Integration of Subtitle and Audio modalities as first-class inputs for evaluation
Hierarchical domain taxonomy (6 domains, 30 subfields) for balanced scenario coverage
Explicit categorization by video duration (Short, Medium, Long) to test temporal context adaptability

Modeling

Base Model: Evaluated models include Gemini 1.5 Pro, GPT-4o, GPT-4V, LLaVA-NeXT-Video, InternVL-Chat-V1.5

Comparison to Prior Work

vs. MVBench: Video-MME covers long videos (up to 1h) and multi-modal inputs (audio/subs), whereas MVBench focuses on short clips.
vs. EgoSchema: Video-MME includes diverse domains (movies, sports, etc.) and audio/subs, while EgoSchema is strictly ego-centric and visual-only.
vs. MMBench-Video [not cited in paper]: Video-MME explicitly targets the 'long context' and 'multi-modal integration' gap, whereas generic benchmarks may not separate these factors.

Limitations

Evaluation relies heavily on commercial APIs (Gemini/GPT-4), which may change over time.
Performance on long videos is still relatively low for open-source models, suggesting high compute/memory barriers.
Text-only filtering using Gemini 1.5 Pro might bias the dataset against questions that Gemini specifically finds easy in text-only mode.

Reproducibility

Code: https://video-mme.github.io

Publicly available: Dataset (Video-MME) and Leaderboard at https://video-mme.github.io. Missing: Exact prompt templates used for all baseline evaluations are not detailed in the main text but code is referenced. Closed-source dependencies: Requires API access for Gemini and GPT-4 evaluations.

📊 Experiments & Results

Evaluation Setup

Zero-shot Multiple Choice QA on Video-MME dataset

Benchmarks:

Video-MME (Video Question Answering) [New]

Metrics:

Accuracy (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Commercial models significantly outperform open-source models, with Gemini 1.5 Pro leading, especially when subtitles are available.
Video-MME (Overall)	Accuracy (%)	77.2	81.3	+4.1
Video-MME (Overall)	Accuracy (%)	59.4	81.3	+21.9
Performance degrades as video length increases, but subtitles help mitigate this drop.
Video-MME (Long Videos)	Accuracy (%)	84.5	77.4	-7.1
Video-MME (Overall)	Accuracy (%)	75.0	81.3	+6.3
Video-MME (Overall)	Accuracy (%)	75.0	79.3	+4.3

Experiment Figures

Examples of Video-MME QA pairs demonstrating the need for multi-modal reasoning.

Main Takeaways

Gemini 1.5 Pro is currently the SOTA model for video understanding, effectively utilizing long context windows.
Subtitles and Audio are critical modalities; models that can process them show significant accuracy gains (up to ~6%).
Long-context video understanding remains a challenge; all models show a performance decline as video duration increases from short to long.
Image-based MLLMs (like GPT-4V) can perform competitively with video-specific models when fed multi-frame inputs, suggesting strong image understanding is foundational.

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Multi-modal Large Language Models (MLLMs) like GPT-4V and Gemini
Basic understanding of Video QA tasks and metrics

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

MLLM: Multi-modal Large Language Model—AI models capable of processing and generating text based on multiple input modalities like images, video, and audio

Video-MME: Video Multi-Modal Evaluation—the specific benchmark proposed in this paper

Certificate Length: The minimum total duration of video sub-clips required to verify that an answer to a question is correct; used as a metric for temporal difficulty

Gemini 1.5 Pro: A commercial MLLM from Google known for its large context window capabilities

GPT-4o: A commercial multimodal model from OpenAI

InternVL-Chat-V1.5: An open-source MLLM designed for image and video understanding

LLaVA-NeXT-Video: An open-source MLLM specifically optimized for video tasks

QA: Question Answering

Visual Domains: Categories of video content, such as Knowledge, Sports, or Film

Temporal Dynamics: Changes and interactions occurring over time within a video sequence