VideoPerceiver: Enhancing Fine-Grained Temporal Perception in Video Multimodal Large Language Models

📝 Paper Summary

Video Multimodal Large Language Models (VMLLMs) Fine-grained video understanding

VideoPerceiver enhances the perception of brief actions and rare events in videos by training on constructed "key-information-missing" clips and using a relative reward mechanism that prioritizes detailed visual reasoning.

Core Problem

Current Video MLLMs fail to perceive fine-grained temporal events (brief actions or rare transient moments) due to uniform sampling strategies and text-centric reward designs that prioritize fluency over visual precision.

Why it matters:

Models miss critical but brief events like traffic accidents in surveillance or fleeting facial micro-expressions, rendering them unreliable for safety-critical applications
Uniform frame sampling discards short-duration visual cues, while standard holistic encoding averages out localized details needed for precise temporal reasoning

Concrete Example: In a long surveillance video, a traffic accident may last only 1-2 seconds (<1% of duration). Standard models using uniform sampling might miss these frames entirely, or generate generic captions ignoring the crash because their reward functions focus on text quality rather than visual evidence recovery.

Key Novelty

Contrastive Missing-Information Recovery & Comparative RL

Creates synthetic training pairs where keyframes of an action are replaced by neighbors; the model learns to identify these missing details by contrasting the full video against the degraded version
Uses a reinforcement learning reward that explicitly compares the quality of answers generated from the full video versus the degraded video, rewarding the model only when the full video yields a better answer

Architecture

The complete training pipeline including Key-Information-Absent Video Construction and Comparative GRPO.

Evaluation Highlights

Achieves 0.61 on MotionBench Repetition Count (VideoPerceiver-7B), surpassing Qwen2.5-VL-7B by +0.26 and marking the first score >0.6 on this subtask
+22.9% average accuracy improvement on VRU-Accident VQA for VideoPerceiver-3B compared to Qwen2.5-VL-3B
State-of-the-art performance on Dense Caption generation for VRU-Accident, outperforming baselines in BLEU, METEOR, and ROUGE metrics

Breakthrough Assessment

8/10

Significant methodology shift for video LLMs by introducing 'negative' video samples (missing keyframes) into both SFT and RL, yielding large gains on difficult fine-grained benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Video Question Answering and Captioning with a focus on fine-grained temporal dynamics

Inputs: Video sequence V and text prompt T

Outputs: Textual response (answer or caption)

Pipeline Flow

Visual Encoding (Shared Encoder)
Multimodal Transformer (Processing concatenated video & text tokens)
Text Generation (LLM Head)

System Modules

Visual Encoder

Encodes video frames into spatiotemporal representations

Model or implementation: Qwen2.5-VL built-in encoder

Multimodal Transformer

Jointly processes visual and textual tokens to generate contextualized embeddings

Model or implementation: Qwen2.5-VL (3B or 7B)

Quality Comparator (RL Stage)

Evaluates generated answers to compute relative rewards

Model or implementation: Task-adaptive metric (Exact match for closed-form, NLI entailment for open-ended)

Novel Architectural Elements

Dual-stream input processing during training: Parallel processing of original video and 'key-information-absent' video to compute contrastive losses and relative rewards
Intermediate Layer Contrastive Learning injection: Applying contrastive objectives specifically at the 16th LLM block to align intermediate representations

Modeling

Base Model: Qwen2.5-VL (3B and 7B variants)

Training Method: Two-stage: Supervised Fine-Tuning (SFT) followed by Comparative GRPO (RL)

Objective Functions:

Purpose: SFT - Standard language modeling loss.

Formally: Cross-entropy on next-token prediction.
Purpose: SFT - Sharpen intra-video discriminability.

Formally: Video-to-Video InfoNCE loss between original video tokens and perturbed tokens (via dropout).
Purpose: SFT - Align text keywords with visual cues.

Formally: Text-to-Video and Video-to-Text InfoNCE losses using key-absent videos as negatives.
Purpose: RL - Incentivize better answers from full videos compared to degraded ones.

Formally: R_comp = M(o, o_ref) - M(o_hat, o_ref), where M is a quality metric.
Purpose: RL - Total advantage for policy update.

Formally: A_total = A_base + lambda * A_comp

Training Data:

VideoPerceiver-80K dataset: 80k clips from HMDB51, CelebV-HQ, MM-AU
Annotations generated via GPT-4/GPT-4o (captions + QA pairs)
Key-information-absent videos: Keyframes identified via BLIP-2 similarity to keywords and replaced with preceding frames

Key Hyperparameters:

sft_learning_rate: 2e-4
contrastive_learning_rate: 1e-4
contrastive_weight: 0.1
+ 3 more
sft_batch_size: 16
rl_learning_rate: 2e-6
rl_batch_size: 8

Compute: SFT: ~50 GPU hours on 8x A100. RL: ~600 GPU hours on 8x A100.

Comparison to Prior Work

vs. Video-R1: VideoPerceiver uses a *comparative* reward (Full vs. Degraded video) rather than just a temporal/spatial reward
vs. Qwen2.5-VL: VideoPerceiver adds specific fine-tuning and RL for transient events, significantly boosting performance on fine-grained tasks
vs. Standard VMLLMs (LLaVA-Video etc.): VideoPerceiver actively constructs negative video samples (missing keyframes) during training to force the model to learn fine-grained dependencies

Limitations

High computational cost for RL stage (600 GPU hours vs 50 for SFT)
Reliance on powerful proprietary LLMs (GPT-4) for data curation and annotation
Performance on non-fine-grained tasks is maintained but not significantly improved compared to base Qwen2.5-VL

Reproducibility

Dataset VideoPerceiver-80K curated from public sources (HMDB51, CelebV-HQ, MM-AU). Code availability not explicitly provided in the text. Qwen2.5-VL base models are public. Specific prompt templates for data generation mentioned (GPT-4) but not fully detailed.

📊 Experiments & Results

Evaluation Setup

Benchmarking on specific fine-grained action and transient event datasets, plus general video understanding benchmarks.

Benchmarks:

MotionBench (Fine-grained action understanding (6 subtasks including recognition and counting))
VRU-Accident (Transient event perception (VQA and Dense Captioning))
VideoPerceiver-80K (Training dataset (not used as evaluation benchmark directly in results tables)) [New]
MVBench (General video understanding)
VideoMME (General video understanding)

Metrics:

Accuracy (MotionBench, VRU-Accident VQA, MVBench, VideoMME, VSIBench, VideoMMMU)
BLEU-4, METEOR, ROUGE-L, CIDEr (VRU-Accident Captioning)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on MotionBench shows state-of-the-art results in fine-grained action tasks, particularly in repetition counting.
MotionBench	Repetition Count Score	0.35	0.61	+0.26
MotionBench	Average Score	57.8	69.7	+11.9
Results on VRU-Accident benchmark demonstrate superior capability in perceiving transient events like traffic accidents.
VRU-Accident (VQA)	Average Accuracy	39.4	62.3	+22.9
VRU-Accident (VQA)	Average Accuracy	48.2	66.0	+17.8
VRU-Accident (Captioning)	CIDEr	20.1	35.2	+15.1
General video understanding benchmarks show that specialized training does not degrade general capabilities.
MVBench	Accuracy	67.4	69.1	+1.7
VideoMME	Accuracy	64.5	65.3	+0.8

Experiment Figures

Conceptual motivation showing how standard models fail on transient events (e.g., traffic accident) while VideoPerceiver succeeds.

Main Takeaways

Substantial improvements on fine-grained action tasks (MotionBench) and transient event tasks (VRU-Accident) validate the method's focus on detailed temporal perception.
The method generalizes well, maintaining or slightly improving performance on general benchmarks (MVBench, VideoMME) rather than overfitting to fine-grained tasks.
Reinforcement learning with a comparative reward (Original vs. Degraded) is highly effective for forcing the model to attend to visual evidence rather than hallucinating based on text priors.

📚 Prerequisite Knowledge

Prerequisites

Multimodal Large Language Models (MLLMs)
Reinforcement Learning (RL) with textual rewards
Contrastive Learning (InfoNCE loss)
Transformer architecture basics

Key Terms

VMLLM: Video Multimodal Large Language Model—AI systems that process both video and text inputs to generate text

SFT: Supervised Fine-Tuning—training a model on labeled examples to adapt it to specific tasks

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that updates policies based on the relative performance of a group of generated outputs

InfoNCE: A contrastive loss function used to maximize agreement between positive pairs (e.g., related text and video) while minimizing agreement with negative pairs

transient event: A very short duration event (e.g., 1-2 seconds) embedded within a much longer video, often carrying critical semantic meaning

Q-Former: A module that acts as a bridge between visual encoders and language models, compressing visual features into a fixed number of tokens

BLIP-2: A vision-language model architecture used here to compute semantic similarity between frames and text keywords for keyframe selection