Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

📝 Paper Summary

Post-training methodologies for Video-LMMs Video reasoning alignment

This survey systematizes post-training for Video-LMMs by integrating Chain-of-Thought SFT, verifiable Reinforcement Learning (GRPO), and Test-Time Scaling into a unified framework for advanced video reasoning.

Core Problem

Video understanding models struggle to transition from basic perception to sophisticated reasoning due to challenges in temporal localization, spatiotemporal grounding, and the lack of a unified post-training methodology.

Why it matters:

Models need to reason about complex event causality and long-term dependencies, which standard pre-training does not sufficiently address
Current literature is fragmented between SFT, RL, and inference scaling, lacking a cohesive roadmap for building reasoning engines
Native visual domains lack the internet-scale self-supervised learning efficiency found in language modeling, requiring specialized post-training to bridge the gap

Concrete Example: A standard Video-LMM might correctly identify objects in a video but fail when asked 'Why did the person fall?' because it cannot ground the causal event in specific temporal segments. An RL-aligned model with temporal rewards would correctly locate the tripping event and explain the cause.

Key Novelty

Unified Video-LMM Post-Training Taxonomy

Organizes post-training into three pillars: SFT for reasoning formats, RL (specifically GRPO) for verifiable optimization without human preference data, and Test-Time Scaling for reliable inference
Defines video-specific adaptations for RL, such as using temporal IoU and spatial grounding consistency as verifiable rewards instead of learned reward models
Integrates 'thinking' processes into video models, enabling them to perform staged viewing, self-correction, and multi-path reasoning during inference

Evaluation Highlights

Review of GRPO-based methods (e.g., Video-RTS) showing purely RL-driven models can match systems using ~165k SFT pairs with only ~6k video-QA triples
Analysis of Test-Time Scaling methods like CoT-Vid showing performance gains saturate after approximately 5 reasoning samples during self-consistency voting
Documentation of large-scale gains from multi-stage pipelines: LongVILA-R1 uses ~36k SFT samples followed by ~68k RL prompts to stabilize long-video reasoning

Breakthrough Assessment

9/10

This is the first comprehensive survey to formalize the 'post-training' phase for Video-LMMs, specifically connecting recent LLM reasoning advances (R1/GRPO) to video-specific challenges.

⚙️ Technical Details

Problem Definition

Setting: Post-training Large Multimodal Models (Video-LMMs) for complex reasoning tasks

Inputs: Video V, Text Query q

Outputs: Reasoning trajectory τ (interleaved thoughts/decisions) and final answer y

Pipeline Flow

Supervised Fine-Tuning (SFT) for Cold-Start
Reinforcement Learning (RL) with Verifiable Rewards
Test-Time Scaling (TTS) for Inference

System Modules

SFT Module

Establish reasoning formats and instruction-following behavior using CoT data

Model or implementation: Various Video-LMMs (e.g., LLaVA-Next, Video-LLaMA)

Policy Optimizer (GRPO)

Optimize reasoning policy using group-relative advantages derived from verifiable outcomes

Model or implementation: SFT-initialized Video-LMM

Inference Engine (TTS)

Enhance reliability during inference via sampling, voting, and self-correction

Model or implementation: RL-aligned Video-LMM

Novel Architectural Elements

Integration of verifiable video-specific rewards (Temporal IoU, Spatial Consistency) directly into the GRPO framework
Hierarchical post-training pipeline explicitly separating 'Watch, Think, Locate, Answer' stages

Modeling

Base Model: Various (e.g., Qwen2-VL, LLaVA-Next, LLaMA-3-V)

Training Method: Multi-stage pipeline: CoT-SFT followed by GRPO/DPO

Objective Functions:

Purpose: Maximize likelihood of reasoning traces and answers relative to group average.

Formally: L_GRPO(θ) = -1/K * Σ [A(k) * Σ log π_θ(y_t | x, y_<t)] + β * KL(π_θ || π_ref)
Purpose: Regress normalized advantages to stabilize training (Reg-GRPO).

Formally: L_Reg-GRPO(θ) = 1/K * Σ (s_θ(τ, x) - A_norm)^2 + β * KL(...)
Purpose: Optimize direct preferences (DPO).

Formally: L_DPO(θ) = -E log σ(β * [log(π_θ(y+|x)/π_ref(y+|x)) - log(π_θ(y-|x)/π_ref(y-|x))])

Adaptation: LoRA or Full Fine-tuning (depending on specific method reviewed)

Training Data:

CoT-SFT: VideoRFT-CoT-102K, MTVR-CoT-72k
RL: Temporal-RLT-490k/32k, Video-R1-260k, MTVR-RL-110k

Key Hyperparameters:

group_size_K: Typically sampled group size for GRPO (e.g., K trajectories)
beta_kl: KL penalty coefficient (often dynamic or fixed)
reward_weights: Task-specific weights λ_m for aggregating different reward components (accuracy, format, IoU)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Video-LLaMA: Shifts focus from simple alignment to complex reasoning via CoT and RL [Video-LLaMA cited in paper]
vs. Standard SFT: Demonstrates that RL (GRPO) achieves better generalization and data efficiency than SFT alone [Standard SFT cited in paper]
vs. PPO-based RLHF: Emphasizes GRPO/DPO which eliminates the need for a separate reward model, reducing training instability [PPO cited in paper]

Limitations

Dependency on high-quality CoT data for the cold-start phase
Reward hacking risks where models optimize verifiable metrics (e.g., IoU) without true understanding
High computational cost of test-time scaling strategies like multi-path sampling
Challenges in designing robust rewards for open-ended generation tasks

Reproducibility

Code: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training

The paper is a survey but provides a curated list of datasets (VideoRFT, MTVR) and codebases for the methods discussed. The authors maintain a GitHub repository for updates: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training.

📊 Experiments & Results

Evaluation Setup

Survey aggregates results from multiple benchmarks assessing video QA, temporal localization, and reasoning.

Benchmarks:

VideoMME (Comprehensive video understanding)
MVBench (Multi-task video benchmark)
LongVideoBench (Long-context video understanding)
Charades-STA (Temporal grounding)

Metrics:

Accuracy
Temporal IoU (tIoU)
Recall@K
Format compliance rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis of data efficiency in RL versus SFT approaches.
Video-RTS internal evaluation	Performance parity	165000	6000	-159000
Impact of Test-Time Scaling (TTS) on reasoning reliability.
CoT-Vid evaluation	Reasoning Accuracy Saturation	1	5	+4

Main Takeaways

Reinforcement Learning with verifiable rewards (GRPO) is significantly more data-efficient than Supervised Fine-Tuning alone.
Test-Time Scaling strategies like self-consistency and iterative refinement provide reliable performance boosts without model re-training.
Separating post-training into 'Cold-Start SFT' and 'Verifiable RL' is the emerging standard for stable Video-LMM training.
Video-specific rewards (temporal IoU, spatial consistency) are critical for grounding reasoning in visual evidence, preventing hallucination.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Multimodal Models (LMMs) and their architecture (encoder-decoder)
Familiarity with Reinforcement Learning from Human Feedback (RLHF) concepts
Basic knowledge of Chain-of-Thought (CoT) reasoning

Key Terms

SFT: Supervised Fine-Tuning—training a model on labeled instructions to establish basic behavior and formatting

GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes policies using group-relative advantages from multiple sampled outputs, removing the need for a separate critic model

CoT: Chain-of-Thought—a reasoning technique where models generate intermediate steps before the final answer to improve accuracy

Test-Time Scaling (TTS): Techniques applied during inference (like generating multiple answers and voting) to improve performance without changing model weights

Verifiable Rewards: Objective outcome measures (e.g., correct answer format, temporal intersection-over-union) used in RL instead of subjective human preferences

Temporal IoU (tIoU): Temporal Intersection over Union—a metric measuring the overlap between a predicted time segment and the ground truth segment in a video

Cold-start: An initial SFT phase using high-quality data to stabilize a model before applying Reinforcement Learning

PPO: Proximal Policy Optimization—a standard RL algorithm that updates policies with clipped objectives to prevent instability

DPO: Direct Preference Optimization—an alignment method that optimizes policies directly on preference pairs without an explicit reward model

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model parameters

Hallucination: When a model generates plausible but incorrect information not supported by the video content

MCTS: Monte Carlo Tree Search—a decision-making algorithm that explores possible future steps to find optimal reasoning paths