Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought

📝 Paper Summary

Video understanding Multimodal Large Language Models (MLLMs)

Video-CoT is a large-scale dataset of videos annotated with fine-grained spatiotemporal reasoning chains to train models that can better locate objects and events in both time and space.

Core Problem

Current Vision-Language Models (VLMs) struggle with fine-grained spatiotemporal reasoning because existing datasets focus on simple summarization or lack integrated spatial (where) and temporal (when) annotations.

Why it matters:

Accurate video comprehension is essential for robotics and interactive systems, which need to know exactly when an event happens and where objects are located
Existing datasets like Jester or FineGym isolate either spatial or temporal dimensions, preventing models from learning the complex interplay between object positions and event timing
Current Chain-of-Thought (CoT) video data often overlooks precise start/end times and pixel coordinates, limiting its utility for grounded reasoning tasks

Concrete Example: When asked 'When does the black SUV leave the adult in pink?', a standard model might just say 'a car drives away'. A model trained on Video-CoT can output the exact start/end timestamps and bounding boxes for the SUV and the adult throughout the event sequence.

Key Novelty

Video-CoT Dataset and CoT-SFT Strategy

Constructs a dataset with 192,000 question-answer pairs and 23,000 Chain-of-Thought (CoT) samples that explicitly detail spatiotemporal reasoning steps (e.g., identifying objects -> tracking movement -> determining time)
Proposes a Chain-of-Thought Supervised Fine-Tuning (CoT-SFT) method that trains models to output intermediate reasoning steps (like 'First, locate the ball...') before the final answer, improving accuracy on complex queries

Architecture

The pipeline for generating the Chain-of-Thought data. It shows how a large VLM (Qwen2.5-VL-72B) takes video and prompts to generate reasoning chains, which are then filtered for quality.

Evaluation Highlights

+14.3 tIoU (temporal Intersection over Union) improvement on the Temporal Video Localization task using CoT-SFT compared to standard fine-tuning (Video-Ans-SFT)
Video-CoT-SFT model (3B parameters) achieves 19.7 tIoU on localization, outperforming larger open-source 7B models like LLaVA-Video-7B (4.3 tIoU)
Significant gains in temporal reasoning: +1.9 score increase on the Temporal Video Reference task using CoT-SFT compared to the baseline Qwen2.5-VL model

Breakthrough Assessment

8/10

Provides a much-needed resource for fine-grained video reasoning. The significant performance jump in temporal localization (from ~5 to ~19 tIoU) validates the importance of spatiotemporal CoT data.

⚙️ Technical Details

Problem Definition

Setting: Spatiotemporal video understanding encompassing six subtasks: Temporal Video Localization (TVL), Video Captioning (VC), Spatial Video Grounding (SVG), Spatio-Temporal Video Grounding (STVG), Spatial Relationship Reference (SRR), and Temporal Video Reference (TVR).

Inputs: Video V with T frames, natural language question Q

Outputs: Target answer A (which may include timestamps, bounding boxes, or descriptive text) and optionally a reasoning chain R

Pipeline Flow

Video Encoder
LLM Backbone (Qwen2.5-VL)
Output Generation (Reasoning Chain -> Final Answer)

System Modules

Video Encoder

Process input video frames into visual embeddings

Model or implementation: Qwen2.5-VL internal vision encoder

LLM Backbone

Process visual features and text query to generate reasoning and answers

Model or implementation: Qwen2.5-VL-3B-Instruct

Novel Architectural Elements

Integration of structured reasoning tokens (<think> tags) directly into the supervised fine-tuning target for video tasks

Modeling

Base Model: Qwen2.5-VL-3B-Instruct

Training Method: Chain-of-Thought Supervised Fine-Tuning (CoT-SFT)

Objective Functions:

Purpose: Maximize likelihood of both reasoning chain and final answer.

Formally: L_CoT(θ) = - [ λ * Σ log P(r_i | V, Q, r_<i) + Σ log P(a_j | V, Q, R, a_<j) ]

Adaptation: Full fine-tuning (implied by lack of LoRA details and 'fine-tuning' terminology)

Training Data:

192,000 spatiotemporal QA pairs
23,000 CoT-annotated samples (generated by Qwen2.5-VL-72B-Instruct and filtered)

Key Hyperparameters:

learning_rate: 1e-6
epochs: 1
optimizer: AdamW

Compute: Single GPU for fine-tuning (mixed precision)

Comparison to Prior Work

vs. LLaVA-Video: Video-CoT explicitly trains on intermediate reasoning steps for spatial/temporal coordinates, whereas LLaVA-Video typically trains on direct QA pairs
vs. Gemini-1.5-pro: Video-CoT (3B) outperforms Gemini (closed) on specific localization metrics through specialized CoT fine-tuning, despite being much smaller

Limitations

The dataset relies on automated generation by a larger VLM (Qwen2.5-VL-72B), which may introduce hallucinations despite filtering
Performance on Spatio-Temporal Video Grounding (STVG) remains relatively low even with improvements, indicating the high difficulty of the task
Experiments are conducted primarily on a 3B parameter model; scaling effects on larger models are not fully explored in the main results

Reproducibility

Code: https://video-cot.github.io/

Dataset Video-CoT and benchmark are publicly available. Evaluation code/scripts availability is implied via project website. Base model Qwen2.5-VL is open source.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on the Video-CoT-Benchmark using fine-tuned models.

Benchmarks:

Video-CoT-Benchmark (Spatiotemporal video understanding (6 subtasks)) [New]

Metrics:

tIoU (Temporal Intersection over Union)
sIoU (Spatial Intersection over Union)
MENTOR (Caption quality metric)
Exact Match (EM)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of fine-tuning methods on the Qwen2.5-VL-3B baseline shows that CoT-SFT significantly boosts temporal localization performance compared to standard Answer-SFT.
Video-CoT-Benchmark (TVL)	tIoU	5.4	19.7	+14.3
Video-CoT-Benchmark (TVR)	MENTOR	4.9	6.8	+1.9
Video-CoT-Benchmark (STVG)	tIoU	44.7	46.2	+1.5
Video-CoT-Benchmark (TVL)	tIoU	8.1	19.7	+11.6
Video-CoT-Benchmark (TVL)	tIoU	4.3	19.7	+15.4

Experiment Figures

Distribution of the dataset across the six tasks (TVL, VC, SVG, STVG, SRR, TVR).

Distribution of video lengths in the dataset.

Main Takeaways

Chain-of-Thought (CoT) fine-tuning provides massive gains (over 10 points) in temporal localization tasks (TVL) compared to standard fine-tuning.
The proposed Video-CoT-SFT method allows a smaller 3B model to outperform significantly larger models (like Gemini-1.5-pro and LLaVA-Video-7B) on specific fine-grained spatiotemporal tasks.
While CoT helps significantly with temporal tasks, improvements in spatial grounding (sIoU) are more modest, suggesting spatial reasoning might require different or additional optimization strategies.

📚 Prerequisite Knowledge

Prerequisites

Vision-Language Models (VLMs) architecture
Supervised Fine-Tuning (SFT) techniques
Intersection over Union (IoU) metrics for detection

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

Chain-of-Thought (CoT): A prompting or training method where the model generates intermediate reasoning steps before the final answer to improve performance on complex tasks

SFT: Supervised Fine-Tuning—training a pre-trained model on a smaller, task-specific labeled dataset

tIoU: Temporal Intersection over Union—a metric measuring the overlap between the predicted time segment and the ground truth time segment

sIoU: Spatial Intersection over Union—a metric measuring the overlap between the predicted bounding box and the ground truth bounding box

TVL: Temporal Video Localization—finding the start and end times of an event in a video

VC: Video Captioning—generating a natural language description of the video

SVG: Spatial Video Grounding—locating an object in a specific video frame using a bounding box

STVG: Spatio-Temporal Video Grounding—tracking an object across multiple frames in both space (bounding box) and time

SRR: Spatial Relationship Reference—identifying spatial relationships between objects (e.g., 'A is behind B')

TVR: Temporal Video Reference—describing events that happen within a specific time interval

MENTOR: A metric used in this paper to evaluate the quality and relevance of generated captions and textual descriptions

Curriculum Learning: A training strategy where the model is exposed to easier examples (shorter reasoning chains) before harder ones (longer, complex chains)

Qwen2.5-VL: A specific family of large vision-language models used as the base model in this paper