← Back to Paper List

Cg-bench: Clue-grounded question answering benchmark for long video understanding

G Chen, Y Liu, Y Huang, Y He, B Pei, J Xu, Y Wang…
Nanjing University, Shanghai Artificial Intelligence Laboratory, Fudan University, Zhejiang University
arXiv, 12/2024 (2024)
MM QA Factuality Reasoning Benchmark

📝 Paper Summary

Long Video Understanding Video Question Answering (VideoQA) Multimodal Large Language Models (MLLMs)
CG-Bench is a long-video benchmark that evaluates whether MLLMs genuinely understand content by requiring them to identify specific video clues justifying their answers, preventing reliance on text-only shortcuts.
Core Problem
Existing long-video benchmarks rely on multiple-choice questions (MCQs) where models can often guess correct answers using text-based elimination or general knowledge without genuinely retrieving relevant visual evidence.
Why it matters:
  • Current models achieve high scores on MCQs via elimination strategies, creating a false sense of capability while lacking true video comprehension.
  • Long video understanding requires retrieving specific moments (clues) from hours of content, a capability not tested by standard QA accuracy metrics.
  • Trustworthy AI requires models to ground their reasoning in actual data evidence rather than hallucinating or guessing based on language biases.
Concrete Example: In a video question about why a character is angry, a model might eliminate option A ('The car is blue') because it contradicts the question text, and select option B without ever seeing the relevant scene. CG-Bench exposes this by asking the model to pinpoint the exact timestamps (clues) that support option B.
Key Novelty
Clue-Grounded Evaluation for Long Videos
  • Annotates 'clue intervals' (timestamped evidence) for every QA pair, allowing the benchmark to check if the model found the right part of the video.
  • Introduces 'White-box' evaluation (model must output timestamps) and 'Black-box' evaluation (compares accuracy on full video vs. short clue clip) to measure credibility.
  • Uses a heuristic 'clue-aided' open-ended evaluation where a judge model (GPT-4o) uses the ground-truth video clues to verify the generated answer.
Architecture
Architecture Figure Figure 1
Illustration of the CG-Bench framework, contrasting standard MCQ evaluation with Clue-Grounded evaluation.
Evaluation Highlights
  • GPT-4o achieves 53.9% accuracy on standard MCQs but this performance is not fully supported by grounding capability.
  • Models show a significant 'credibility gap': accuracy drops from ~53% (standard MCQ) to ~21% when enforcing strict clue-grounding requirements.
  • Open-source models like Qwen2-VL-72B score 51.4% on MCQs, rivaling GPT-4o, but struggle equally with long-context retrieval and grounding.
Breakthrough Assessment
8/10
Significantly raises the bar for video evaluation by moving beyond simple QA accuracy to evidence-based grounding. The focus on 'credibility' and the rigorous clue-based metrics addresses a major flaw in current MLLM benchmarks.
×