Rethinking Chain-of-Thought Reasoning for Videos

📝 Paper Summary

Video Understanding Multimodal Large Language Models (MLLMs) Efficient Inference

Long, human-like Chain-of-Thought reasoning is unnecessary for effective video understanding; models trained via direct RL to produce concise reasoning on compressed visual tokens achieve better performance with significantly lower latency.

Core Problem

Chain-of-Thought (CoT) in video MLLMs incurs massive computational overhead due to redundant visual tokens and long text generation, while requiring expensive supervised fine-tuning (SFT) data.

Why it matters:

The combination of high-resolution video inputs (thousands of tokens) and verbose CoT outputs creates memory and latency bottlenecks that hinder deployment
Standard CoT training pipelines rely on two-stage training (SFT + RL) with costly human-annotated reasoning traces, slowing down development cycles
Empirical benchmarks reveal that prompting base models for concise reasoning fails, and token compression degrades performance significantly when applied to unadapted models

Concrete Example: When a standard CoT model (like Video-R1) answers a video question, it generates 'pondering' phrases like 'Hmm, let's think...' and long intermediate steps, taking ~11.9 seconds. The proposed concise model outputs a brief rationalization and the answer in ~1.7 seconds, while maintaining or exceeding accuracy.

Key Novelty

Concise-Reasoning RL Framework with Integrated Token Compression

Replaces the standard two-stage CoT training (SFT + RL) with a single-stage Reinforcement Learning process (using GRPO) that rewards correct answers without requiring annotated reasoning traces
Integrates visual token compression (pruning/merging) directly into the training loop, allowing the model to adapt to sparse visual information rather than just applying compression at inference time
Enforces a 'concise reasoning' decoding style that generates short, dense rationales instead of lengthy, rambling chains of thought

Architecture

The proposed efficient post-training framework compared to traditional CoT pipelines.

Evaluation Highlights

+1.6% accuracy improvement on VideoMME (60.3 vs 58.7) compared to the base Qwen2.5-VL model, while outperforming the lengthy Video-R1 baseline (59.5)
~7x reduction in inference latency compared to standard CoT models (1.71s vs 11.9s) due to concise decoding and token compression
Eliminates the Supervised Fine-Tuning (SFT) stage entirely, reducing training time from >30 hours (for Video-R1) to ~5 hours

Breakthrough Assessment

8/10

Challenge the prevailing orthodoxy that 'more tokens + longer reasoning = better performance' in video LLMs. By proving that concise RL-trained models outperform verbose CoT models, it offers a highly practical path for efficient deployment.

⚙️ Technical Details

Problem Definition

Setting: Video Question Answering and Reasoning

Inputs: Video frames x and textual query q

Outputs: Concise text response y containing a brief reasoning trace followed by the final answer

Pipeline Flow

Input Processing: Video → Visual Encoder → Token Compression
Reasoning: Compressed Tokens + Query → LLM (Qwen2.5-VL) → Concise Reasoning Generation
Output: Final Answer

System Modules

Visual Encoder (Input Processing)

Extract visual features from video frames

Model or implementation: Qwen2.5-VL Vision Encoder

Token Compressor (Input Processing)

Reduce the number of visual tokens to lower prefilling cost

Model or implementation: AIM-based Pruning/Merging

LLM Backbone

Process multimodal inputs and generate text

Model or implementation: Qwen2.5-VL-7B-Instruct

Novel Architectural Elements

Training-integrated token compression: The token pruning/merging is active during the RL post-training phase, allowing the policy to adapt to information loss
Hybrid Attention Strategy: Selectively disables FlashAttention only in layers performing token pruning to maintain compatibility with hardware accelerators

Modeling

Base Model: Qwen2.5-VL-7B-Instruct

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Encourage correct answers and proper formatting relative to the group average.

Formally: Maximize expected advantage A_i, where A_i is the normalized reward (accuracy + format) of sample i within group G.
Purpose: Prevent model from deviating too far from the reference policy (pre-trained model).

Formally: Subtract β * D_KL(π_θ || π_ref) from the objective.

Adaptation: Full fine-tuning (RL only, no SFT)

Trainable Parameters: All parameters updated via RL

Key Hyperparameters:

group_size_G: Not explicitly reported in the paper
kl_beta: Not explicitly reported in the paper
learning_rate: Not explicitly reported in the paper

Compute: Training takes ~5 hours on 4 high-end GPUs (compared to >30 hours for CoT baselines)

Comparison to Prior Work

vs. Video-R1: Skips SFT entirely; trains for concise rather than long reasoning; integrates token compression into training
vs. Qwen2.5-VL (Base): Adds RL post-training and token compression for better efficiency and accuracy
vs. AIM: Adapts the model to compressed tokens via RL training rather than applying compression only at test time

Limitations

The approach relies on the base model already having some reasoning capability; it aligns rather than injects new knowledge.
Requires rule-based rewards (ground truth answers), making it less applicable to open-ended generation without clear correct answers.
Token compression might still discard fine-grained visual details necessary for extremely subtle visual tasks.

Reproducibility

Code: https://github.com/LaVi-Lab/Rethink_CoT_Video

Code is publicly available at https://github.com/LaVi-Lab/Rethink_CoT_Video. The paper uses public benchmarks (VideoMME, MLVU) and base models (Qwen2.5-VL). Specific hyperparameters (LR, batch size) are not detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on video understanding benchmarks

Benchmarks:

VideoMME (General Video Understanding)
MLVU (Long Video Understanding)
VideoHolmes (Complex Video Reasoning)
MVBench (General Video Understanding)

Metrics:

Accuracy (%)
Inference Latency (seconds)
Training Time (hours)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on accuracy showing the proposed method (Ours) outperforms both the direct-answer baseline and the heavy Chain-of-Thought baseline (Video-R1) on key benchmarks.
VideoMME	Accuracy	58.7	60.3	+1.6
MLVU	Accuracy	63.6	65.5	+1.9
VideoHolmes	Accuracy	50.9	53.2	+2.3
Efficiency comparisons demonstrating significant reductions in training time and inference latency.
Inference Latency	Seconds per query	11.9	1.71	-10.19
Training Cost	GPU Hours	30	5	-25

Experiment Figures

Bar charts comparing Training Cost (GPU hours) and Inference Runtime (seconds) between Direct Answer, Concise Reason (Ours), and Chain-of-Thought (Video-R1).

Main Takeaways

Concise reasoning with RL is sufficient for video understanding, achieving better accuracy than long CoT on 3 out of 4 benchmarks (VideoMME, MLVU, VideoHolmes).
Token compression applied to pre-trained models without adaptation hurts performance, but incorporating it into RL training recovers accuracy while reducing compute.
The 'pondering' patterns in long CoT (e.g., 'Hmm, let me think') are often redundant and do not contribute to accuracy in general video tasks.
Direct RL training (GRPO) without Supervised Fine-Tuning (SFT) is effective, removing the bottleneck of collecting expensive CoT annotations.

📚 Prerequisite Knowledge

Prerequisites

Multimodal Large Language Models (MLLMs)
Reinforcement Learning (RL) in LLMs
Transformer Architecture (Attention mechanisms)
Token Compression techniques

Key Terms

CoT: Chain-of-Thought—a prompting or training technique where models generate intermediate reasoning steps before the final answer

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that optimizes policies by comparing a group of sampled outputs for the same input, removing the need for a separate critic model

SFT: Supervised Fine-Tuning—training a model on a dataset of labeled input-output pairs (usually the first step before RL)

Prefilling: The initial phase of LLM inference where the model processes the input tokens (image/video/text) to compute Key-Value caches

Decoding: The sequential generation phase of LLM inference where the model produces output tokens one by one

Token Compression: Techniques like pruning (removing) or merging (combining) visual tokens to reduce computational cost

KV Cache: Key-Value Cache—stored intermediate states in the Transformer attention mechanism used to speed up generation

Video-R1: A baseline CoT video model that uses a two-stage SFT + RL pipeline to generate long reasoning traces

AIM: A token compression method that merges similar tokens and prunes uninformative ones