Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

📝 Paper Summary

Video Understanding Multimodal Large Language Models (MLLMs) Reinforcement Learning (RL)

VITAL is an agentic framework that enables MLLMs to reason about long videos by actively sampling frames via a visual toolbox, optimized using a difficulty-aware reinforcement learning algorithm.

Core Problem

Existing MLLMs relying on text-based Chain-of-Thought suffer from insufficient cross-modal interaction and high hallucination rates when reasoning over long videos.

Why it matters:

Long video understanding is computationally expensive using standard context extension methods
Text-only reasoning disconnects the model from visual evidence, leading to error accumulation in multi-step tasks
Current RL post-training methods (like GRPO) suffer from difficulty imbalance when applied to multi-task video domains (e.g., mixing easy QA with hard temporal grounding)

Concrete Example: In a temporal grounding task, a text-based CoT model might hallucinate a timestamp based on a single frame or caption, whereas VITAL uses a 'video clipping' tool to actively resample frames at specific intervals, verifying the event's precise start and end times.

Key Novelty

Video Intelligence via Tool-Augmented Learning (VITAL) with Difficulty-aware GRPO

Introduces a visual toolbox (specifically video clipping) allowing the model to 'think with videos' by iteratively requesting and processing new visual information during the reasoning chain
Proposes Difficulty-aware Group Relative Policy Optimization (DGRPO) which scales rewards based on task and sample difficulty to prevent easy tasks from dominating the learning process
Constructs two large-scale datasets (MTVR-CoT-72k and MTVR-RL-110k) specifically filtered for reasoning difficulty to support the tool-augmented training

Architecture

The overall framework of VITAL, illustrating the multi-round tool-augmented reasoning process.

Evaluation Highlights

+11.4% accuracy improvement on LongVideo-Reason (79.3% vs 67.9%) compared to the previous best open-source model
+7.3% Recall@1 improvement on VidChapters-7M temporal grounding (34.7% vs 27.4%)
DGRPO training increases average performance on difficult benchmarks from 50.3 to 52.1 compared to standard GRPO

Breakthrough Assessment

8/10

Significant performance jumps on long video benchmarks by successfully integrating tool use with RL. The difficulty-aware optimization addresses a common failure mode in multi-task RL.

⚙️ Technical Details

Problem Definition

Setting: Multi-task video reasoning including Video QA, Temporal Grounding, and Grounded QA

Inputs: User question T_0 and Video V_0

Outputs: Multimodal Chain-of-Thought trajectory containing reasoning steps, tool calls, and final answer

Pipeline Flow

Visual Encoder & MLLM (encodes initial video and question)
Generation Loop (Model generates <think> trace)
Tool Decider (Model decides to output <tool_call> or <answer>)
Visual Toolbox (Executes tool if called, e.g., clips video)
Context Update (New visual frames appended to context)
Final Answer (Loop terminates when <answer> is generated)

System Modules

MLLM Backbone

Generates reasoning steps, decides tool calls, and produces final answers

Model or implementation: Qwen2.5-VL-7B

Visual Toolbox

Executes video processing requests from the MLLM

Model or implementation: Rule-based functions (Video Clipping)

Novel Architectural Elements

End-to-end agentic loop where the MLLM pauses generation to ingest new visual tokens from the toolbox
Integration of a 'video clipping' tool specifically for refining temporal localization during the reasoning chain

Modeling

Base Model: Qwen2.5-VL-7B

Training Method: Difficulty-aware Group Relative Policy Optimization (DGRPO)

Objective Functions:

Purpose: Maximize accuracy of the final answer.

Formally: R_acc(tau) (IoU for grounding, Exact Match for QA)
Purpose: Enforce correct XML formatting for thought and tool traces.

Formally: R_format(tau)
Purpose: Encourage tool usage to prevent degeneration to text-only reasoning.

Formally: R_tool(tau) (reward if at least one tool called)
Purpose: Balance rewards across easy/hard tasks.

Formally: R(tau) = R_acc * alpha * beta (scaling factors based on task type and sample pass rates)

Adaptation: Full fine-tuning (implied by 'trained... for one epoch')

Training Data:

MTVR-CoT-72k (SFT): 54k basic reasoning, 18k tool-augmented long video reasoning
MTVR-RL-110k (RL): 94k basic reasoning, 16k tool-augmented long video reasoning
Data filtered using 'PassAll' (too easy) and 'PassNone' (too hard) rollout metrics

Key Hyperparameters:

learning_rate_sft: 1e-5
learning_rate_rl: 1e-6
batch_size_sft: 256
+ 3 more
batch_size_rl: 64
rollouts: 8
weight_decay: 1e-2

Compute: 640 GPU hours (total for 4 stages)

Comparison to Prior Work

vs. LongVA/LongVILA: VITAL uses tool-augmented sampling (video clipping) rather than processing all frames at once, reducing computational load and hallucination
vs. Video-R1: VITAL introduces *multimodal* CoT (seeing new frames during reasoning) vs. text-only CoT
vs. GRPO (Standard): VITAL uses DGRPO to dynamically scale rewards based on task difficulty, preventing easy tasks from dominating the gradient

Limitations

Requires ground truth time ranges to generate tool parameters during training (uses noise addition for augmentation)
Zero-shot tool use (without specific training) with GPT/Gemini did not improve performance, indicating the need for specific fine-tuning
Training is computationally intensive (multi-round rollouts)
Visual tools other than video clipping (e.g., caption, QA) were found ineffective

Reproducibility

Code: https://zhang9302002.github.io/thinkingwithvideos-page/

Code is available at https://zhang9302002.github.io/thinkingwithvideos-page/. The paper describes dataset construction sources (Charades-STA, ActivityNet-MR, etc.) and filtering logic. Pretrained backbone is Qwen2.5-VL-7B.

📊 Experiments & Results

Evaluation Setup

Evaluated on 11 video understanding benchmarks covering VQA, reasoning, and temporal grounding.

Benchmarks:

LongVideo-Reason (LVR) (Long video reasoning)
VidChapters-7M (Long video temporal grounding)
Video-MME (Long video QA)
Charades-STA (Short video temporal grounding)

Metrics:

Accuracy (Acc)
Recall@1, IoU=0.3/0.5/0.7 (R@1, mIoU)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
LongVideo-Reason (LVR)	Accuracy	67.9	79.3	+11.4
VidChapters-7M	R@1, IoU=0.5	27.4	34.7	+7.3
Average of 4 benchmarks (Video-MMMU, Charades-STA, LVR, VidChapters)	Average Score	50.3	52.1	+1.8
Average of 4 benchmarks	Average Score	47.7	52.1	+4.4

Experiment Figures

Performance comparison (radar chart) of VITAL against baselines on multiple benchmarks.

Qualitative comparison between Text-based CoT and Multimodal CoT.

Main Takeaways

Tool-augmented multimodal CoT significantly outperforms text-based CoT, especially in long video scenarios where visual evidence is crucial.
Video clipping is the most effective tool; other tools like captioning or QA generators did not yield improvements.
DGRPO is essential for stabilizing multi-task RL training where task difficulties vary significantly (e.g., binary QA vs. continuous IoU grounding).
Temporal grounding and question answering are mutually beneficial when trained jointly.

📚 Prerequisite Knowledge

Prerequisites

Multimodal Large Language Models (MLLMs)
Reinforcement Learning from Human Feedback (RLHF)
Chain-of-Thought (CoT) prompting

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs generated from the same input to reduce variance

DGRPO: Difficulty-aware Group Relative Policy Optimization—the authors' proposed variant that scales rewards based on task-specific and sample-specific difficulty weights

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

SFT: Supervised Fine-Tuning—training the model on labeled data before applying reinforcement learning

IoU: Intersection over Union—a metric measuring the overlap between the predicted time range and the ground truth time range in temporal grounding

Temporal Grounding: The task of identifying the specific start and end timestamps of an event described in text within a video

Hallucination: The generation of factually incorrect information or details not present in the source content (video)

Visual Toolbox: A set of external functions (e.g., video clipping) the model can invoke to process visual data during generation