Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong Li Lee, W. Hsu
International Conference on Machine Learning
(2024)
MMReasoningQAKGBenchmark
📝 Paper Summary
Video UnderstandingMultimodal Large Language Models (MLLM)
Video-of-Thought enhances video reasoning by decomposing complex tasks into a chain of sub-problems, moving from fine-grained pixel-level grounding via scene graphs to high-level cognitive semantic analysis.
Core Problem
Existing video MLLMs struggle with complex videos due to two bottlenecks: a lack of fine-grained spatial-temporal perceptive understanding and an inability to perform deep cognitive-level reasoning.
Why it matters:
Current methods mostly perform shallow perception on simple videos, failing to understand intricate spatiotemporal characteristics
Complex real-world applications require understanding not just pixel movements but also the causal implications and commonsense reasoning behind actions
Standard Chain-of-Thought prompting for language does not account for the specific spatiotemporal grounding needs of video data
Concrete Example:In a video showing a person jumping from a height, a standard model might identify the action 'jumping', but fail to reason that this action implies a risk of fracture or requires specific medical attention, or fail to track the specific target 'red oil truck' before analyzing its interaction with a tanker.
Key Novelty
Video-of-Thought (VoT) Framework with MotionEpic MLLM
Introduces MotionEpic, a video MLLM that incorporates Video Spatial-Temporal Scene Graphs (STSG) to achieve fine-grained pixel-level grounding
Proposes VoT, a reasoning framework that breaks video QA into sequential steps: target identification, temporal grounding (via STSG), action analysis with commonsense, and answer verification
Uses generated scene graphs as intermediate 'rationales' or evidence to ground the high-level reasoning in low-level video pixels
Architecture
Schematic overview of the MotionEpic architecture.
Breakthrough Assessment
8/10
Proposes a logically sound hierarchy for video reasoning that addresses the grounding-hallucination gap in MLLMs. The integration of explicit scene graph generation within the reasoning chain is a significant methodological advance.
⚙️ Technical Details
Problem Definition
Setting: Complex Video Question Answering involving spatial-temporal reasoning
Inputs: Raw video V, Text prompt/Question Q
Outputs: Textual Answer A with supporting reasoning steps
Identify potential targets involved in the user's question to focus observation
Model or implementation: MotionEpic (Vicuna-7B v1.5 backbone)
Temporal Grounder (Reasoning Framework (VoT))
Ground the spatial-temporal tracklets of the identified targets within the video
Model or implementation: MotionEpic (STSG Generator)
Action Analyzer (Reasoning Framework (VoT))
Interpret object trajectories and interactions with neighbors using commonsense knowledge
Model or implementation: MotionEpic (Vicuna-7B v1.5 backbone)
Answer Ranker (Reasoning Framework (VoT))
Score the likelihood of each candidate answer based on the derived insights
Model or implementation: MotionEpic (Vicuna-7B v1.5 backbone)
Verifier (Reasoning Framework (VoT))
Verify the final answer against pixel grounding and commonsense cognition
Model or implementation: MotionEpic (Vicuna-7B v1.5 backbone)
Novel Architectural Elements
Integration of a recurrent Graph Transformer to encode multi-frame STSG information directly into the MLLM embedding space
A 5-step explicit reasoning chain (VoT) that mandates STSG generation as an intermediate 'grounding' step before semantic inference
Modeling
Base Model: Vicuna-7B (v1.5)
Training Method: Instruction Tuning with LoRA
Objective Functions:
Purpose: Global matching.
Formally: Predict if overall input video and STSG are paired (L1)
Purpose: Full graph generation.
Formally: Generate the whole STSG expression given a video (L2)
Purpose: Action grounding.
Formally: Output object tracklets given video and action description (L3)
Purpose: Key object description.
Formally: Describe temporal actions and output tracklets given video and key objects (L4)
Purpose: Object recognition.
Formally: Output object label and tracklet given a bounding box (L5)
Adaptation: LoRA (Low-Rank Adaptation) on the LLM backbone
Trainable Parameters: STSG encoder, Video Projector (Q-Former), LoRA parameters (Video encoder and LLM backbone frozen)
Training Data:
Pre-training: Webvid dataset
Grounding-aware tuning: Video-STSG pairs
Instruction tuning: VideoChat and Video-ChatGPT datasets
Compute: Not reported in the paper
Comparison to Prior Work
vs. Video-LLaMA/Video-ChatGPT: MotionEpic explicitly integrates Spatial-Temporal Scene Graphs (STSG) for fine-grained grounding, whereas others rely on holistic video features [cited in paper]
vs. Standard CoT (Zero-shot): VoT enforces a specific low-to-high level reasoning structure (Pixel -> Object -> Action -> Semantics) rather than a generic 'think step by step' prompt [cited in paper]
Limitations
Relies on the availability or generation quality of STSG data during the training phase
Inference speed may be impacted by the multi-step reasoning process (5 distinct prompt/response steps per question)
Performance depends heavily on the underlying object detection and tracking implicit in the STSG generation capability
Project page provided at https://haofei.vip/VoT. The paper describes the use of standard backbones (Vicuna-7B, ViT-L/14) and datasets (Webvid, VideoChat). Specific hyperparameters for the Graph Transformer and LoRA settings are not detailed in the provided text.
📊 Experiments & Results
Evaluation Setup
Video Question Answering on complex benchmarks
Benchmarks:
8 complex video QA benchmarks (Video Question Answering)
Metrics:
Accuracy
Statistical methodology: Not explicitly reported in the paper
Experiment Figures
The Video-of-Thought (VoT) reasoning framework flowchart.
Main Takeaways
The paper claims to strikingly boost state-of-the-art performance across 8 complex video QA benchmarks in both fine-tuning and zero-shot settings.
The framework establishes new state-of-the-art results by leveraging the decomposition of reasoning into perception (grounding) and cognition (commonsense).
Qualitative analysis suggests that fine-grained spatial-temporal grounding (via STSG) is a critical prerequisite for accurate high-level video reasoning.
📚 Prerequisite Knowledge
Prerequisites
Multimodal Large Language Models (MLLM)
Chain-of-Thought (CoT) Prompting
Vision Transformers (ViT)
Scene Graph Generation
Key Terms
STSG: Spatial-Temporal Scene Graph—a structured representation of video content consisting of objects (nodes) and their relationships (edges) across time frames
MotionEpic: The novel video MLLM proposed in this paper, capable of encoding and generating STSGs for fine-grained grounding
VoT: Video-of-Thought—the proposed reasoning framework that decomposes video tasks into perception, grounding, and cognition steps
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes the pre-trained model weights and injects trainable rank decomposition matrices
Q-Former: Querying Transformer—a module used to bridge the gap between frozen image encoders and frozen LLMs by extracting visual features
Vicuna: An open-source chatbot trained by fine-tuning LLaMA on user-shared conversations