← Back to Paper List

Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

Yunlong Tang, Jing Bi, Pinxin Liu, Zhenyu Pan, Zhangyun Tan, Qianxiang Shen, Jiani Liu, Hang Hua, Junjia Guo, Yunzhong Xiao, Chao Huang, Zhiyuan Wang, Susan Liang, Xinyi Liu, Yizhi Song, Yuhe Nie, Jia-Xing Zhong, Bozheng Li, Daiqing Qi, Ziyun Zeng, A. Vosoughi, Luchuan Song, Zeliang Zhang, Daiki Shimada, Han Liu, Jiebo Luo, Chenliang Xu
University of Rochester, Northwestern University, Carnegie Mellon University, University of California, Santa Barbara, Purdue University, University of California, Los Angeles, University of Oxford, Brown University, University of Virginia, Sony Group Corporation
arXiv.org (2025)
MM RL Reasoning Benchmark

📝 Paper Summary

Post-training methodologies for Video-LMMs Video reasoning alignment
This survey systematizes post-training for Video-LMMs by integrating Chain-of-Thought SFT, verifiable Reinforcement Learning (GRPO), and Test-Time Scaling into a unified framework for advanced video reasoning.
Core Problem
Video understanding models struggle to transition from basic perception to sophisticated reasoning due to challenges in temporal localization, spatiotemporal grounding, and the lack of a unified post-training methodology.
Why it matters:
  • Models need to reason about complex event causality and long-term dependencies, which standard pre-training does not sufficiently address
  • Current literature is fragmented between SFT, RL, and inference scaling, lacking a cohesive roadmap for building reasoning engines
  • Native visual domains lack the internet-scale self-supervised learning efficiency found in language modeling, requiring specialized post-training to bridge the gap
Concrete Example: A standard Video-LMM might correctly identify objects in a video but fail when asked 'Why did the person fall?' because it cannot ground the causal event in specific temporal segments. An RL-aligned model with temporal rewards would correctly locate the tripping event and explain the cause.
Key Novelty
Unified Video-LMM Post-Training Taxonomy
  • Organizes post-training into three pillars: SFT for reasoning formats, RL (specifically GRPO) for verifiable optimization without human preference data, and Test-Time Scaling for reliable inference
  • Defines video-specific adaptations for RL, such as using temporal IoU and spatial grounding consistency as verifiable rewards instead of learned reward models
  • Integrates 'thinking' processes into video models, enabling them to perform staged viewing, self-correction, and multi-path reasoning during inference
Evaluation Highlights
  • Review of GRPO-based methods (e.g., Video-RTS) showing purely RL-driven models can match systems using ~165k SFT pairs with only ~6k video-QA triples
  • Analysis of Test-Time Scaling methods like CoT-Vid showing performance gains saturate after approximately 5 reasoning samples during self-consistency voting
  • Documentation of large-scale gains from multi-stage pipelines: LongVILA-R1 uses ~36k SFT samples followed by ~68k RL prompts to stabilize long-video reasoning
Breakthrough Assessment
9/10
This is the first comprehensive survey to formalize the 'post-training' phase for Video-LMMs, specifically connecting recent LLM reasoning advances (R1/GRPO) to video-specific challenges.
×