Cg-bench: Clue-grounded question answering benchmark for long video understanding

📝 Paper Summary

Long Video Understanding Video Question Answering (VideoQA) Multimodal Large Language Models (MLLMs)

CG-Bench is a long-video benchmark that evaluates whether MLLMs genuinely understand content by requiring them to identify specific video clues justifying their answers, preventing reliance on text-only shortcuts.

Core Problem

Existing long-video benchmarks rely on multiple-choice questions (MCQs) where models can often guess correct answers using text-based elimination or general knowledge without genuinely retrieving relevant visual evidence.

Why it matters:

Current models achieve high scores on MCQs via elimination strategies, creating a false sense of capability while lacking true video comprehension.
Long video understanding requires retrieving specific moments (clues) from hours of content, a capability not tested by standard QA accuracy metrics.
Trustworthy AI requires models to ground their reasoning in actual data evidence rather than hallucinating or guessing based on language biases.

Concrete Example: In a video question about why a character is angry, a model might eliminate option A ('The car is blue') because it contradicts the question text, and select option B without ever seeing the relevant scene. CG-Bench exposes this by asking the model to pinpoint the exact timestamps (clues) that support option B.

Key Novelty

Clue-Grounded Evaluation for Long Videos

Annotates 'clue intervals' (timestamped evidence) for every QA pair, allowing the benchmark to check if the model found the right part of the video.
Introduces 'White-box' evaluation (model must output timestamps) and 'Black-box' evaluation (compares accuracy on full video vs. short clue clip) to measure credibility.
Uses a heuristic 'clue-aided' open-ended evaluation where a judge model (GPT-4o) uses the ground-truth video clues to verify the generated answer.

Architecture

Illustration of the CG-Bench framework, contrasting standard MCQ evaluation with Clue-Grounded evaluation.

Evaluation Highlights

GPT-4o achieves 53.9% accuracy on standard MCQs but this performance is not fully supported by grounding capability.
Models show a significant 'credibility gap': accuracy drops from ~53% (standard MCQ) to ~21% when enforcing strict clue-grounding requirements.
Open-source models like Qwen2-VL-72B score 51.4% on MCQs, rivaling GPT-4o, but struggle equally with long-context retrieval and grounding.

Breakthrough Assessment

8/10

Significantly raises the bar for video evaluation by moving beyond simple QA accuracy to evidence-based grounding. The focus on 'credibility' and the rigorous clue-based metrics addresses a major flaw in current MLLM benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Long-video Question Answering with Temporal Grounding

Inputs: Long video V (>10 mins), Question Q, Candidate Options {A, B, C, ...}

Outputs: Selected Option O and supporting time interval (clue) T_clue

Pipeline Flow

Video Collection & Filtering
QAC (Question-Answer-Clue) Annotation
Review Iteration
Evaluation (Standard MCQ, White-box, Black-box, Open-ended)

System Modules

QAC Annotation

Human annotators create questions, answers, and mark specific time intervals (clues) that support the answer.

Model or implementation: Human Annotators

White-Box Evaluator (Evaluation)

Assess if the model can explicitly predict the correct time interval.

Model or implementation: Evaluated MLLM

Black-Box Evaluator (Evaluation)

Assess if the model implicitly attends to the correct clues by comparing full-video vs. clue-only performance.

Model or implementation: Evaluated MLLM

Novel Architectural Elements

Two-tiered credibility evaluation framework (White-box and Black-box) designed specifically to penalize correct answers derived from non-visual shortcuts.
Clue-aided open-ended evaluation heuristic: uses ground-truth clue intervals to provide visual context to a text-only judge (GPT-4o) to verify open-ended answers.

Modeling

Base Model: Various MLLMs evaluated (GPT-4o, Gemini-1.5 Pro, Qwen2-VL, etc.)

Comparison to Prior Work

vs. VideoMME: CG-Bench adds clue intervals for every question, enabling grounding evaluation.
vs. MLVU: CG-Bench focuses on credibility and preventing shortcuts via clue grounding.
vs. NExT-GQA: CG-Bench targets long videos (>10 mins) and covers diverse domains beyond just action/egocentric data.

Limitations

White-box evaluation requires models capable of outputting timestamps, which not all MLLMs support natively.
The 'Clue Recovery Rate' assumes that if a model answers correctly on the long video, it *should* have attended to the clue; however, other hidden redundancies might exist.
Evaluation relies on the quality of GPT-4o as a judge for open-ended questions, which may still have biases.
Locating short clues in very long videos is inherently difficult, leading to low absolute scores on strict metrics.

Reproducibility

Code: https://cg-bench.github.io/leaderboard/

Data and annotations are released at https://cg-bench.github.io/leaderboard/. The paper describes the annotation and filtering process in detail. Specific prompts for the automated judge (GPT-4o) are mentioned in supplementary materials.

📊 Experiments & Results

Evaluation Setup

Evaluation of multiple MLLMs on the CG-Bench dataset.

Benchmarks:

CG-Bench (Long-Video Question Answering & Temporal Grounding) [New]

Metrics:

MCQ Accuracy
acc.@IoU (Accuracy at Intersection over Union threshold)
mIoU (mean Intersection over Union)
CRR (Clue Recovery Rate)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Standard MCQ accuracy results show commercial models leading, but open-source models are competitive.
CG-Bench (Long-Video MCQ)	Accuracy	51.4	53.9	+2.5
Credibility evaluation (White-box) reveals a sharp drop in performance when grounding is required.
CG-Bench (White-box)	acc.@IoU (strict grounding)	53.9	21.7	-32.2

Experiment Figures

Distribution of video durations and clue interval positions.

Main Takeaways

Current MLLMs significantly underperform on long video understanding when verified by clue grounding, despite decent MCQ scores.
There is a large 'credibility gap': models often select the right answer without knowing where the information is in the video.
Open-source models like Qwen2-VL are closing the gap with commercial models like GPT-4o on standard accuracy, but all models struggle with retrieval in long contexts.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Multimodal Large Language Models (MLLMs)
Familiarity with Video Question Answering (VideoQA) tasks
Basic concepts of Intersection over Union (IoU) for temporal grounding

Key Terms

MLLM: Multimodal Large Language Model—AI systems capable of processing and generating both text and visual data (images/videos).

MCQ: Multiple-Choice Question—a format where the model selects the correct answer from a list of options.

tIoU: Temporal Intersection over Union—a metric measuring the overlap between a predicted time interval and the ground-truth time interval.

Clue-Grounded: An evaluation approach where the model must identify the specific video segment (clue) that contains the information needed to answer the question.

White-box evaluation: An evaluation setting where the model is explicitly asked to output the timestamps of the relevant clue along with the answer.

Black-box evaluation: An evaluation setting that infers model reliability by comparing its performance on the full video versus its performance when given only the short clue clip.

CRR: Clue Recovery Rate—a metric measuring how well a model maintains its accuracy when processing the full long video compared to when it sees only the relevant clue clip.

Context Dilution: The phenomenon where a model's ability to retrieve relevant information degrades as the amount of irrelevant input (context length) increases.

Hallucination: In AI, when a model generates plausible-sounding but incorrect or factually baseless information.