Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition

📝 Paper Summary

Video Understanding Multimodal Large Language Models (MLLM)

Video-of-Thought enhances video reasoning by decomposing complex tasks into a chain of sub-problems, moving from fine-grained pixel-level grounding via scene graphs to high-level cognitive semantic analysis.

Core Problem

Existing video MLLMs struggle with complex videos due to two bottlenecks: a lack of fine-grained spatial-temporal perceptive understanding and an inability to perform deep cognitive-level reasoning.

Why it matters:

Current methods mostly perform shallow perception on simple videos, failing to understand intricate spatiotemporal characteristics
Complex real-world applications require understanding not just pixel movements but also the causal implications and commonsense reasoning behind actions
Standard Chain-of-Thought prompting for language does not account for the specific spatiotemporal grounding needs of video data

Concrete Example: In a video showing a person jumping from a height, a standard model might identify the action 'jumping', but fail to reason that this action implies a risk of fracture or requires specific medical attention, or fail to track the specific target 'red oil truck' before analyzing its interaction with a tanker.

Key Novelty

Video-of-Thought (VoT) Framework with MotionEpic MLLM

Introduces MotionEpic, a video MLLM that incorporates Video Spatial-Temporal Scene Graphs (STSG) to achieve fine-grained pixel-level grounding
Proposes VoT, a reasoning framework that breaks video QA into sequential steps: target identification, temporal grounding (via STSG), action analysis with commonsense, and answer verification
Uses generated scene graphs as intermediate 'rationales' or evidence to ground the high-level reasoning in low-level video pixels

Architecture

Schematic overview of the MotionEpic architecture.

Breakthrough Assessment

8/10

Proposes a logically sound hierarchy for video reasoning that addresses the grounding-hallucination gap in MLLMs. The integration of explicit scene graph generation within the reasoning chain is a significant methodological advance.

⚙️ Technical Details

Problem Definition

Setting: Complex Video Question Answering involving spatial-temporal reasoning

Inputs: Raw video V, Text prompt/Question Q

Outputs: Textual Answer A with supporting reasoning steps

Pipeline Flow

Target Identification (Text-based)
Temporal Grounding (Video + Prompt → STSG Tracklet)
Action Analysis (STSG + Commonsense → Description)
Answer Reasoning (Ranking Candidates)
Verification

System Modules

Target Identifier (Reasoning Framework (VoT))

Identify potential targets involved in the user's question to focus observation

Model or implementation: MotionEpic (Vicuna-7B v1.5 backbone)

Temporal Grounder (Reasoning Framework (VoT))

Ground the spatial-temporal tracklets of the identified targets within the video

Model or implementation: MotionEpic (STSG Generator)

Action Analyzer (Reasoning Framework (VoT))

Interpret object trajectories and interactions with neighbors using commonsense knowledge

Model or implementation: MotionEpic (Vicuna-7B v1.5 backbone)

Answer Ranker (Reasoning Framework (VoT))

Score the likelihood of each candidate answer based on the derived insights

Model or implementation: MotionEpic (Vicuna-7B v1.5 backbone)

Verifier (Reasoning Framework (VoT))

Verify the final answer against pixel grounding and commonsense cognition

Model or implementation: MotionEpic (Vicuna-7B v1.5 backbone)

Novel Architectural Elements

Integration of a recurrent Graph Transformer to encode multi-frame STSG information directly into the MLLM embedding space
A 5-step explicit reasoning chain (VoT) that mandates STSG generation as an intermediate 'grounding' step before semantic inference

Modeling

Base Model: Vicuna-7B (v1.5)

Training Method: Instruction Tuning with LoRA

Objective Functions:

Purpose: Global matching.

Formally: Predict if overall input video and STSG are paired (L1)
Purpose: Full graph generation.

Formally: Generate the whole STSG expression given a video (L2)
Purpose: Action grounding.

Formally: Output object tracklets given video and action description (L3)
Purpose: Key object description.

Formally: Describe temporal actions and output tracklets given video and key objects (L4)
Purpose: Object recognition.

Formally: Output object label and tracklet given a bounding box (L5)

Adaptation: LoRA (Low-Rank Adaptation) on the LLM backbone

Trainable Parameters: STSG encoder, Video Projector (Q-Former), LoRA parameters (Video encoder and LLM backbone frozen)

Training Data:

Pre-training: Webvid dataset
Grounding-aware tuning: Video-STSG pairs
Instruction tuning: VideoChat and Video-ChatGPT datasets

Compute: Not reported in the paper

Comparison to Prior Work

vs. Video-LLaMA/Video-ChatGPT: MotionEpic explicitly integrates Spatial-Temporal Scene Graphs (STSG) for fine-grained grounding, whereas others rely on holistic video features [cited in paper]
vs. Standard CoT (Zero-shot): VoT enforces a specific low-to-high level reasoning structure (Pixel -> Object -> Action -> Semantics) rather than a generic 'think step by step' prompt [cited in paper]

Limitations

Relies on the availability or generation quality of STSG data during the training phase
Inference speed may be impacted by the multi-step reasoning process (5 distinct prompt/response steps per question)
Performance depends heavily on the underlying object detection and tracking implicit in the STSG generation capability

Reproducibility

Code: https://haofei.vip/VoT

Project page provided at https://haofei.vip/VoT. The paper describes the use of standard backbones (Vicuna-7B, ViT-L/14) and datasets (Webvid, VideoChat). Specific hyperparameters for the Graph Transformer and LoRA settings are not detailed in the provided text.

📊 Experiments & Results

Evaluation Setup

Video Question Answering on complex benchmarks

Benchmarks:

8 complex video QA benchmarks (Video Question Answering)

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

The Video-of-Thought (VoT) reasoning framework flowchart.

Main Takeaways

The paper claims to strikingly boost state-of-the-art performance across 8 complex video QA benchmarks in both fine-tuning and zero-shot settings.
The framework establishes new state-of-the-art results by leveraging the decomposition of reasoning into perception (grounding) and cognition (commonsense).
Qualitative analysis suggests that fine-grained spatial-temporal grounding (via STSG) is a critical prerequisite for accurate high-level video reasoning.

📚 Prerequisite Knowledge

Prerequisites

Multimodal Large Language Models (MLLM)
Chain-of-Thought (CoT) Prompting
Vision Transformers (ViT)
Scene Graph Generation

Key Terms

STSG: Spatial-Temporal Scene Graph—a structured representation of video content consisting of objects (nodes) and their relationships (edges) across time frames

MotionEpic: The novel video MLLM proposed in this paper, capable of encoding and generating STSGs for fine-grained grounding

VoT: Video-of-Thought—the proposed reasoning framework that decomposes video tasks into perception, grounding, and cognition steps

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes the pre-trained model weights and injects trainable rank decomposition matrices

Q-Former: Querying Transformer—a module used to bridge the gap between frozen image encoders and frozen LLMs by extracting visual features

Vicuna: An open-source chatbot trained by fine-tuning LLaMA on user-shared conversations