MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

📝 Paper Summary

Multi-modal Large Language Models (MLLMs) Video Understanding Benchmarks

MVBench evaluates MLLMs on 20 temporal video tasks defined via a static-to-dynamic transformation method, revealing significant gaps in current models that VideoChat2 aims to close.

Core Problem

Existing MLLM benchmarks primarily focus on spatial understanding in static images, failing to assess temporal evolution and procedural activities crucial for video understanding.

Why it matters:

Current benchmarks (e.g., MMBench) rely on static image QA, missing dynamic context like movement direction or action sequences
Prior video benchmarks (e.g., VideoChatGPT) are limited to basic tasks or specific domains, lacking comprehensive temporal skill assessment
Heavy reliance on manual annotation makes scaling video benchmarks expensive and slow

Concrete Example: A static image task asks 'Is the man on the stage?', which only requires spatial perception. The corresponding video task asks 'What direction is the man moving?', requiring reasoning about temporal changes over multiple frames.

Key Novelty

Static-to-Dynamic Task Definition & VideoChat2 Baseline

Systematically defines 20 temporal video tasks by converting 9 static image tasks into dynamic versions (e.g., 'Position' becomes 'Moving Direction')
Automated QA generation pipeline converts 11 public video datasets into multiple-choice questions using LLMs, ensuring ground-truth accuracy without manual labeling
Introduces VideoChat2, a strong baseline trained progressively with diverse instruction data (2M samples) to bridge the temporal understanding gap

Architecture

The progressive training pipeline of VideoChat2, detailing three stages of alignment and tuning.

Evaluation Highlights

VideoChat2 achieves 51.1% average accuracy on MVBench, surpassing the previous best open-source model (VideoChat) by >15%
VideoChat2 outperforms GPT-4V (43.5%) by 7.6% on MVBench average accuracy
On the ActivityNet zero-shot QA benchmark, VideoChat2 achieves 49.1% accuracy, surpassing VideoChatGPT (35.2%) by ~14%

Breakthrough Assessment

8/10

Significantly advances video MLLM evaluation by focusing specifically on temporal tasks often ignored by image-based benchmarks. The proposed model establishes a strong new SOTA.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot video question answering (multiple-choice) to evaluate temporal understanding capabilities

Inputs: Video clip V and a multiple-choice question Q

Outputs: The correct option (A, B, C, or D) corresponding to the video content

Pipeline Flow

Visual Encoder (extracts features from video frames)
QFormer (compresses visual features into query tokens)
Linear Projection (maps query tokens to LLM dimension)
LLM (generates text response based on visual and text inputs)

System Modules

Visual Encoder

Extract spatial-temporal features from input video frames

Model or implementation: UMT-L (Unified Multi-modal Transformer - Large)

QFormer

Align visual features with text and compress redundant visual information

Model or implementation: BERT-base initialized QFormer

Large Language Model

Generate text responses/answers

Model or implementation: Vicuna-7B (v0 or v1.5)

Novel Architectural Elements

Progressive multi-modal training pipeline that unfreezes the visual encoder (UMT-L) in later stages for better temporal adaptation
Incorporation of instruction text into QFormer input (but not the question text) to extract instruction-relevant visual tokens

Modeling

Base Model: Vicuna-7B connected to UMT-L visual encoder

Training Method: Progressive Multi-Stage Training (Alignment -> Connection -> Instruction Tuning)

Objective Functions:

Purpose: Align visual queries with text.

Formally: Vision-Text Contrastive learning (VTC)
Purpose: Determine if image and text match.

Formally: Vision-Text Matching (VTM)
Purpose: Generate text grounded in vision.

Formally: Vision-grounded Text Generation (VTG) loss

Adaptation: LoRA (rank=16, alpha=32, dropout=0.1) on LLM; Full fine-tuning of QFormer and Visual Encoder in later stages

Training Data:

Stage 1: 15M image captions (CC3M, CC12M) + 10M video captions (WebVid-10M)
Stage 2: Adds 2M image captions (COCO, VG, SBU) + 10M video captions (InternVid)
Stage 3: 2M instruction samples from 34 datasets (Conversation, Caption, VQA, Reasoning, Classification)

Key Hyperparameters:

learning_rate: Stage 1/2: 1e-4, Stage 3: 2e-5
batch_size: Stage 1: 2048, Stage 2: 512, Stage 3: 128
epochs: Stage 1: 10, Stage 2: 1, Stage 3: 3
+ 2 more
input_resolution: 224x224
input_frames: Stage 1/2: 4 frames, Stage 3: 8 frames

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. VideoChat: VideoChat2 unfreezes the visual encoder (UMT-L) and uses significantly more diverse instruction data (2M vs 17K)
vs. VideoChatGPT: VideoChat2 uses a more powerful visual encoder and QFormer bridge rather than simple pooling, achieving higher accuracy with fewer frames (16 vs 100)
vs. SeViLA [not cited in paper]: SeViLA uses keyframe selection; VideoChat2 processes uniformly sampled frames via UMT-L temporal modeling

Limitations

Performance drops on counting tasks (Action Count, Moving Count) and character recognition, likely due to lack of specific training data
Benchmark generation relies on ChatGPT, which may introduce biases or errors in question formation
Evaluation requires specific prompt engineering ('Best Option: (') to robustly extract answers from MLLMs

Reproducibility

Code: https://github.com/OpenGVLab/Ask-Anything

Code, models, and data publicly available at https://github.com/OpenGVLab/Ask-Anything. The benchmark generation process relies on ChatGPT, and the instruction tuning data is a compilation of open-source datasets.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on 20 distinct temporal tasks defined in MVBench

Benchmarks:

MVBench (Multi-modal Video Understanding (20 sub-tasks)) [New]
Video Conversation Benchmark (Open-ended conversation evaluation)
Zero-shot Video QA (MSVD-QA, MSRVTT-QA, ActivityNet-QA)

Metrics:

Accuracy (%)
Score (1-5 scale for conversation quality)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on MVBench showing VideoChat2's dominance over existing open-source MLLMs and GPT-4V.
MVBench	Average Accuracy	35.5	51.1	+15.6
MVBench	Average Accuracy	43.5	51.1	+7.6
MVBench	Average Accuracy	32.7	51.1	+18.4
Zero-shot QA performance on standard video benchmarks confirms generalization capability.
ActivityNet-QA	Accuracy	35.2	49.1	+13.9
MSRVTT-QA	Accuracy	49.3	54.1	+4.8

Experiment Figures

The concept of static-to-dynamic task definition, mapping image tasks (Spatial) to video tasks (Temporal).

Main Takeaways

Existing MLLMs struggle significantly with temporal tasks, often performing close to random guess or text-only baselines
Unfreezing the visual encoder and using diverse video-centric instruction data are critical for temporal understanding
VideoChat2 demonstrates that a specialized 7B model can outperform GPT-4V on specific temporal video understanding tasks
Instruction data ablation shows that video data contributes more to performance gains (42.1% -> 50.5%) than simply adding more image data

📚 Prerequisite Knowledge

Prerequisites

Multi-modal Large Language Models (MLLM) architecture
Vision Transformers (ViT)
Instruction Tuning
Low-Rank Adaptation (LoRA)

Key Terms

MLLM: Multi-modal Large Language Model—an AI system capable of processing and generating text based on multiple input modalities like images and video

QFormer: Query Transformer—a module from BLIP-2 that bridges the gap between frozen visual encoders and frozen LLMs using learnable query tokens

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model parameters

Static-to-Dynamic: The paper's method of defining video tasks by extending static image concepts (e.g., position) into temporal equivalents (e.g., trajectory)

TSN: Temporal Segment Network—a sampling strategy that divides a video into segments and samples frames to represent the entire duration