TEMPLE: Incentivizing Temporal Understanding of Video Large Language Models via Progressive Pre-SFT Alignment

📝 Paper Summary

Video Large Language Models Temporal Reasoning Model Alignment

TEMPLE improves Video LLMs' temporal reasoning by generating synthetic preference pairs via video perturbation and applying Direct Preference Optimization before standard supervised fine-tuning.

Core Problem

Current Video LLMs struggle with temporal reasoning because standard next-token prediction on static datasets fails to enforce dynamic understanding, leading to reliance on visual shortcuts.

Why it matters:

Models frequently hallucinate or overlook events in videos, leading to unreliable responses for dynamic content
Existing datasets lack strong temporal correspondence, making it hard for models to learn event sequencing
Standard post-SFT alignment assumes basic capabilities are already learned, but temporal understanding is often missing from the base model

Concrete Example: In a video analysis of an archery clip, a standard model (Qwen2-VL) hallucinates that the archer 'releases the arrow' when the video only shows the preparation phase, failing to recognize the specific temporal segment provided.

Key Novelty

Progressive Pre-SFT Alignment with Synthetic Temporal Preferences

Constructs a self-supervised preference dataset by comparing model responses to clean videos (chosen) vs. temporally perturbed videos (rejected) like reversed or shuffled clips
Reverses the standard training order by applying DPO *before* instruction tuning (SFT) to establish fundamental temporal alignment first
Uses a curriculum learning strategy that gradually increases the difficulty of perturbations during training to improve data efficiency

Architecture

The automated data construction pipeline for TEMPLE.

Evaluation Highlights

+3.4% improvement on Video-MME (temporal dimension) using Qwen2-VL-7B compared to standard SFT baselines
Consistently outperforms standard SFT-then-DPO approaches across MLVU and Vinoground benchmarks with a relatively small set of self-generated data
Demonstrates high transferability across different model architectures (LLaVA-Video, Kangaroo) and scales (7B, 8B)

Breakthrough Assessment

7/10

Offers a clever, scalable data generation pipeline and challenges the standard SFT-then-DPO paradigm. While improvements are consistent, they are incremental rather than transformative shifts in architecture.

⚙️ Technical Details

Problem Definition

Setting: Video-to-Text Generation with a focus on temporal alignment

Inputs: Video frames V and text instruction I

Outputs: Textual response R describing the video content or answering the instruction

Pipeline Flow

Video Filtering & Selection
Preference Pair Generation
Progressive Pre-SFT Alignment Training

System Modules

Video Filter (Data Pipeline)

Select temporality-rich videos by detecting scene boundaries and filtering out static or overly repetitive content

Model or implementation: TransNetV2 (scene detection) + SigLIP (similarity grouping)

Perturbation Engine (Data Pipeline)

Create 'rejected' inputs by modifying video temporal structure

Model or implementation: Heuristic algorithms

Response Generator (Data Pipeline)

Generate captions for clean and perturbed videos to form preference pairs

Model or implementation: Target Video LLM (e.g., Qwen2-VL)

Alignment Trainer

Optimize the model using DPO with curriculum learning

Model or implementation: Video LLM (Qwen2-VL, LLaVA-Video, etc.)

Novel Architectural Elements

Pre-SFT Alignment pipeline topology: Placing the DPO module structurally before the SFT module in the training workflow

Modeling

Base Model: Qwen2-VL-7B (primary), also tested on LLaVA-Video-7B and Kangaroo-8B

Training Method: Direct Preference Optimization (DPO) followed by Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Optimize policy to assign higher probability to chosen (clean video) responses over rejected (perturbed video) responses.

Formally: DPO loss function minimizing -log σ(β * log(π_theta(yw|x)/π_ref(yw|x)) - β * log(π_theta(yl|x)/π_ref(yl|x)))
Purpose: Standard instruction tuning loss.

Formally: Cross-entropy loss on next-token prediction

Training Data:

13k videos selected from Panda-70M
Filtered down from larger set using TransNetV2 and SigLIP similarity metrics
Preference pairs generated via self-prompting (no external API)

Key Hyperparameters:

learning_rate: 5e-7 (DPO), 2e-5 (SFT)
batch_size: 128 (SFT), 64 (DPO)
epochs: 1 (DPO), 1 (SFT)
+ 3 more
beta: 0.1 (DPO KL penalty)
max_length: 2048
schedule: Cosine decay

Compute: 8x NVIDIA H800 GPUs

Comparison to Prior Work

vs. POVID: Extends the visual perturbation concept to the temporal domain (shuffling, reversing) rather than just noise/resolution
vs. Tarsier2: Uses a self-sufficient pipeline without proprietary LLMs and applies DPO *before* SFT
vs. Standard SFT: Introduces explicit negative signals for temporal errors via DPO
+ 1 more
vs. SimPO [not cited in paper]: Unlike SimPO which modifies the DPO objective to be reference-free, TEMPLE focuses on the data construction and training stage (Pre-SFT) while using standard DPO.

Limitations

Reliance on the model's own capabilities to generate captions means the 'chosen' response quality is bounded by the base model
Focuses primarily on captioning tasks for preference learning, which might not cover all reasoning types
Preliminary analysis limited to a small sample (19 videos) for manual annotation

Reproducibility

Code: https://github.com/lscpku/TEMPLE

Code is publicly available at https://github.com/lscpku/TEMPLE. The paper describes the filtering thresholds (e.g., scene length 0.2s-16s, 4-32 groups) and hyperparameters in detail. Data source is Panda-70M (public).

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on standard video understanding benchmarks

Benchmarks:

Video-MME (Comprehensive video understanding (Short, Medium, Long))
MLVU (Multi-task video understanding)
Vinoground (Temporal visual reasoning)

Metrics:

Accuracy (%)
Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on Video-MME showing consistent improvement over the SFT baseline across video lengths.
Video-MME	Overall Score	52.7	54.1	+1.4
Video-MME	Temporal Dimension Score	46.6	50.0	+3.4
Performance on MLVU benchmark demonstrating generalization.
MLVU	Overall Score	59.2	61.3	+2.1
Video-MME	Overall Score	53.4	54.1	+0.7

Experiment Figures

Training loss curves comparing Pre-SFT Alignment vs. Standard SFT.

Illustration of the difficulty factor 'r' in perturbations.

Main Takeaways

Pre-SFT DPO consistently outperforms the traditional SFT-then-DPO pipeline for temporal alignment.
The method generalizes well across different base models (Qwen2, LLaVA-Video, Kangaroo) without architecture-specific tuning.
Curriculum learning (progressive difficulty) provides a measurable boost over static difficulty training.
Temporal perturbations (shuffling, reversing) are effective in creating hard negatives that force the model to learn event dynamics.

📚 Prerequisite Knowledge

Prerequisites

Direct Preference Optimization (DPO)
Supervised Fine-Tuning (SFT)
Video Large Language Models structure (Visual Encoder + LLM)

Key Terms

DPO: Direct Preference Optimization—an alignment method that optimizes a model to prefer 'chosen' responses over 'rejected' ones without a separate reward model

SFT: Supervised Fine-Tuning—training a model on high-quality instruction-response pairs to learn how to follow user commands

Pre-SFT Alignment: The authors' proposed strategy of running DPO *before* SFT, contrary to the standard practice of running it afterwards

Temporal Perturbation: Deliberately corrupting video inputs (e.g., reversing clips, shuffling order) to create 'rejected' examples where the model's output is likely incorrect or confused

Curriculum Learning: A training strategy where the difficulty of tasks (in this case, the subtlety of perturbations) increases over time

SigLIP: A specific vision-language model used here to compute similarity between video frames for filtering redundant content

TransNetV2: A deep learning model specifically designed for detecting shot boundaries and scene transitions in videos