Factorized Learning for Temporally Grounded Video-Language Models

📝 Paper Summary

Video-Language Models (Video LLMs) Temporal Grounding

D2VLM decouples video understanding into pure grounding followed by evidence-referenced answering, supported by special tokens that capture visual semantics and a factorized preference optimization algorithm.

Core Problem

Existing video LLMs handle temporal grounding and textual response in a coupled manner without clear logical structure, leading to sub-optimal objectives where grounding tokens focus only on timestamps rather than visual semantics.

Why it matters:

Coupled learning objectives confuse the model, leading to inaccurate temporal localization and hallucinations in textual answers.
Current special tokens for grounding only represent timestamps, missing the rich visual context of the event needed for the subsequent textual answer.
Lack of explicit preference optimization for temporal grounding limits the model's ability to align with human intent on localization tasks.

Concrete Example: In a video query asking 'What did I put in the rack?', a standard model might output a timestamp '[23.5s - 46.1s]' but incorrectly describe the object as a 'Large basket' because the timestamp token didn't capture the specific visual event of placing a 'Small bag'.

Key Novelty

D2VLM (Decoupled & Dependent Video-Language Model)

Decomposes generation into two stages: first 'pure grounding' to find evidence, then 'interleaved text-evidence answering' that references that evidence.
Introduces a visual-semantic '<evi>' token that aggregates features from relevant video frames, explicitly capturing event content rather than just time boundaries.
Proposes Factorized Preference Optimization (FPO) to optimize both textual quality and probabilistic grounding accuracy using a synthetic factorized dataset.

Architecture

The D2VLM framework showing the two-stage generation process and the evidence token mechanism.

Evaluation Highlights

+21.6% average F1 improvement on E.T. Bench Grounding compared to E.T.Chat-3.8B (38.6% → 60.2%).
+4.4% improvement on Charades-STA R@1(IoU=0.5) compared to E.T.Chat-3.8B (45.9% → 50.3%).
Outperforms larger 7B/13B models (e.g., LITA-13B, TimeChat-7B) using a smaller 3.8B parameter model across grounding and captioning benchmarks.

Breakthrough Assessment

8/10

Significant performance jumps (+20% F1) on grounding benchmarks with a smaller model. The factorized preference optimization for grounding is a novel and methodologically sound contribution to aligning multimodal models.

⚙️ Technical Details

Problem Definition

Setting: Temporally grounded video question answering

Inputs: Video V and textual question Q

Outputs: Textual response R interleaved with temporal grounding information (intervals)

Pipeline Flow

Video Encoding (ViT + Q-Former)
LLM Decoder (Stage 1: Pure Grounding)
Visual Semantic Aggregation (Update <evi> tokens)
LLM Decoder (Stage 2: Interleaved Answer Generation)

System Modules

Video Encoder

Encode raw video frames into visual tokens

Model or implementation: ViT-G/14 (EVA-CLIP) + Q-Former-like compressor

LLM Decoder (Grounding Stage) (Generation)

Generate pure temporal grounding tokens (<evi>) to localize relevant events

Model or implementation: Phi-3-Mini-3.8B

Visual Aggregator

Inject visual semantics into <evi> tokens

Model or implementation: Non-parametric operation (Average Pooling)

LLM Decoder (Answering Stage) (Generation)

Generate final response referencing grounded evidence

Model or implementation: Phi-3-Mini-3.8B

Novel Architectural Elements

Two-stage generation constraint: Pure grounding phase followed by evidence-referencing phase
Evidence token mechanism: Explicit aggregation of visual features into the special token embedding based on frame-token similarity

Modeling

Base Model: Phi-3-Mini-3.8B

Training Method: Factorized Preference Optimization (FPO) + Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Standard next-token prediction.

Formally: Cross-entropy loss on text tokens.
Purpose: Align tokens with ground truth temporal intervals.

Formally: Binary Cross-Entropy between frame-level similarity scores and ground truth intervals.
Purpose: Enforce consistency between grounding stage and answering stage.

Formally: L2 distance between <evi> embeddings in Stage 1 and Stage 2.
Purpose: Factorized Preference Optimization.

Formally: DPO-style loss incorporating both text probability and probabilistic temporal grounding (product of frame similarities).

Adaptation: LoRA (Low-Rank Adaptation)

Trainable Parameters: LoRA parameters and projector/Q-Former (implied from setup)

Training Data:

E.T. Instruct 164K dataset
Synthetic preference dataset constructed by perturbing sub-video events (shifting time, deleting events, distorting text)

Key Hyperparameters:

training_time: 1 day
gpu_config: 4x NVIDIA H100

Compute: 4x NVIDIA H100 GPUs, 1 day training time

Comparison to Prior Work

vs. E.T.Chat: Adds decoupled grounding stage, visual tokens, and FPO training.
vs. TimeChat/LITA: Explicitly models grounding probability in preference optimization; captures event-level visual semantics in tokens.
vs. TRACE [not cited in paper]: TRACE focuses on causal event modeling; D2VLM focuses on factorized generation and preference optimization.

Limitations

Performance on episodic memory tasks remains relatively low (14.4% F1).
Data synthesis currently only generates negative (dis-preferred) samples; positive sample diversity is not explored.
Relies on off-the-shelf LLMs (Qwen) for data synthesis, inheriting their biases.
Requires two-stage generation which may increase inference latency compared to single-pass models (though not explicitly analyzed).

Reproducibility

Code: https://github.com/nusnlp/d2vlm

Code available at https://github.com/nusnlp/d2vlm. Model initialized from E.T. Chat stage-2 checkpoint. Synthetic data generation pipeline described in detail (using Qwen for text perturbations).

📊 Experiments & Results

Evaluation Setup

Evaluated on diverse video understanding tasks including grounding, captioning, and retrieval.

Benchmarks:

E.T. Bench Grounding (Temporal Grounding (5 sub-tasks: Retrieval, Action Localization, Summarization, etc.))
E.T. Bench Dense Captioning (Dense Video Captioning & Step Localization)
Charades-STA (Moment Retrieval)
YouCook2 (Dense Video Captioning)

Metrics:

F1 Score (for grounding)
Recall@1 (IoU=0.5, 0.7)
CIDEr
SODA_c
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on E.T. Bench showing significant improvements in grounding capability.
E.T. Bench Grounding	Avg F1	38.6	60.2	+21.6
E.T. Bench Grounding	Avg F1	46.6	60.2	+13.6
E.T. Bench Dense Captioning	Avg F1 (Grounding)	31.4	37.5	+6.1
Standard benchmarks for moment retrieval and captioning.
Charades-STA	R@1(IoU=0.5)	45.9	50.3	+4.4
YouCook2	CIDEr	8.1	10.6	+2.5
Ablation studies validating component contributions.
E.T. Bench Grounding	Avg F1	35.6	39.5	+3.9
E.T. Bench Grounding	Avg F1	39.5	42.3	+2.8
E.T. Bench Grounding	Avg F1	37.1	39.5	+2.4

Experiment Figures

Performance radar chart comparing D2VLM with other methods (Video-LLaMA, TimeChat, etc.) across 5 tasks.

The factorized preference data synthesis pipeline.

Main Takeaways

Decoupling grounding and answering (D2VLM) consistently outperforms coupled approaches, especially when enforcing consistency between stages.
Event-level visual semantic capture in <evi> tokens is critical; treating tokens just as timestamps is sub-optimal.
Factorized Preference Optimization (FPO) provides further gains by explicitly optimizing grounding probability alongside text generation.
Smaller models (3.8B) with specialized architecture can significantly outperform larger generalist models (7B/13B) on grounded video tasks.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture and Autoregressive generation
Video-Language Models (Video LLMs)
Reinforcement Learning from Human Feedback (RLHF) / Direct Preference Optimization (DPO)
Visual feature extraction (ViT, Q-Former)

Key Terms

D2VLM: The proposed framework: Decoupled Learning for Temporally Grounded Video-Language Models.

FPO: Factorized Preference Optimization—an algorithm extending DPO to explicitly optimize both textual response and probabilistic temporal grounding.

evidence token (<evi>): A special token that not only marks a temporal event but explicitly aggregates visual features from salient video frames to serve as context.

interleaved text-evidence generation: Generating the final answer by mixing text tokens with evidence tokens that reference previously grounded events.

pure grounding: A preliminary generation stage where the model only outputs temporal evidence tokens before generating the full textual answer.

sub-video event: A semantically meaningful segment of a video (e.g., a specific action instance) used as the unit for perturbation in data synthesis.

DPO: Direct Preference Optimization—a stable method for aligning language models to preferences without a separate reward model.

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique.