Chinese University of Hong Kong MMLab,
Tsinghua University
arXiv.org
(2025)
MMRLReasoningBenchmark
📝 Paper Summary
Video ReasoningMultimodal Reinforcement Learning
Video-R1 adapts the DeepSeek-R1 reinforcement learning paradigm to video MLLMs by introducing a contrastive temporal reward that forces models to rely on frame order rather than static shortcuts.
Core Problem
Standard RL methods like GRPO lack explicit signals for temporal reasoning, causing models to exploit 'shortcuts' by answering based on single frames rather than understanding event progression.
Why it matters:
Models often guess answers from static visual cues (e.g., seeing a stove and guessing 'cooking') without processing the actual sequence of events, failing at causal reasoning
Existing video datasets focus on recognition rather than complex reasoning, limiting the effectiveness of reinforcement learning for dynamic tasks
Directly applying text-based RL (like DeepSeek-R1) to video fails to incentivize the specific capability of temporal modeling
Concrete Example:When asked 'What happens after the man opens the fridge?', a standard model might identify 'cooking' from a later frame and answer correctly even if the frames are shuffled. T-GRPO detects this shortcut by comparing performance on ordered vs. shuffled frames, rewarding the model only if it fails on the shuffled version but succeeds on the ordered one.
Key Novelty
Temporal Group Relative Policy Optimization (T-GRPO)
Modifies the GRPO algorithm to include a contrastive temporal reward: the model generates answers for both ordered and shuffled video frames
Assigns positive rewards only when the model's accuracy on ordered frames strictly exceeds its accuracy on shuffled frames, penalizing reliance on static frame content
Combines image-based reasoning data (for general logic) with video data (for temporal logic) to overcome the scarcity of high-quality video reasoning benchmarks
Architecture
The T-GRPO (Temporal Group Relative Policy Optimization) training process.
Evaluation Highlights
Video-R1-7B achieves 37.1% accuracy on the VSI-Bench (Video Spatial Intelligence) benchmark, explicitly stated to outperform the proprietary GPT-4o model
Demonstrates significant improvements across general video benchmarks including VideoMMMU, MVBench, and TempCompass (qualitative claim, exact deltas not in text)
Breakthrough Assessment
8/10
First systematic application of the R1 reasoning paradigm to video. Addresses the critical 'temporal shortcut' flaw in video MLLMs with a clever contrastive RL mechanism (T-GRPO).
⚙️ Technical Details
Problem Definition
Setting: Multimodal Video Question Answering and Reasoning with Reinforcement Learning
Inputs: Video frames V (ordered or shuffled) and text prompt/question q
Outputs: Reasoning chain (Chain-of-Thought) and final answer a
Pipeline Flow
Input Processing (Video Frames + Text)
Video-LLM Inference (Qwen2.5-VL Backbone)
Output Generation (Reasoning Trace + Final Answer)
System Modules
Qwen2.5-VL-7B-Instruct
Process multimodal inputs and generate CoT and answers
Model or implementation: Qwen2.5-VL-7B-Instruct
Modeling
Base Model: Qwen2.5-VL-7B-Instruct
Training Method: Temporal Group Relative Policy Optimization (T-GRPO)
Objective Functions:
Purpose: Encourage temporal reasoning by comparing performance on ordered vs. shuffled frames.
Formally: r_t = alpha * (p - p_tilde) if p > p_tilde else 0, where p is accuracy on ordered frames and p_tilde on shuffled frames.
Purpose: Regulate output length to prevent overthinking.
Formally: r_l = omega if length in [l_min, l_max] else 0.
Purpose: Optimize policy using group relative advantages.
Code, models, and data are publicly released at https://github.com/tulerfeng/Video-R1. SFT data is generated using Qwen2.5-VL-72B-Instruct. RL training uses Video-R1-260k.
📊 Experiments & Results
Evaluation Setup
Zero-shot evaluation on video reasoning benchmarks
Benchmarks:
VSI-Bench (Video Spatial Reasoning)
VideoMMMU (Multi-discipline Video Reasoning)
MVBench (General Video Understanding)
TempCompass (Temporal Orientation)
Metrics:
Accuracy (%)
Statistical methodology: Not explicitly reported in the paper
Experiment Figures
Distribution of the Video-R1-260k dataset used for RL training
Main Takeaways
Video-R1-7B establishes a strong result on VSI-Bench (37.1%), surpassing GPT-4o, validating that RL can unlock spatial-temporal reasoning in smaller models
The T-GRPO algorithm successfully prevents 'shortcut learning' by enforcing a performance gap between ordered and shuffled video inputs
Mixing image reasoning data with video data is effective for cold-starting the reasoning capabilities before temporal RL training
The model exhibits 'aha moments' (self-correction) during video reasoning, similar to behaviors observed in text-only reasoning models like DeepSeek-R1
DeepSeek-R1 training paradigm (SFT Cold Start + RL)
Key Terms
T-GRPO: Temporal Group Relative Policy Optimization—Proposed RL algorithm that rewards models for performing better on ordered video frames than shuffled ones
GRPO: Group Relative Policy Optimization—An RL algorithm that normalizes rewards within a group of outputs for the same input to reduce variance
SFT: Supervised Fine-Tuning—Training on labeled data to initialize the model before RL
CoT: Chain-of-Thought—A step-by-step reasoning path generated by the model before the final answer
VSI-Bench: Video Spatial Intelligence Benchmark—A dataset testing spatial and temporal reasoning capabilities in video models
WER: Word Error Rate—Metric for OCR tasks measuring edit distance between predicted and reference text
ROUGE: Recall-Oriented Understudy for Gisting Evaluation—Metric for evaluating text generation quality based on n-gram overlap