TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding

📝 Paper Summary

Video Temporal Grounding (VTG) Video Large Language Models (Video-LLMs) Mixture-of-Experts (MoE)

TimeExpert decomposes video grounding tasks by dynamically routing specific token types (timestamps, saliency scores, text) to specialized experts within a Mixture-of-Experts architecture.

Core Problem

Existing Video-LLMs process distinct task tokens (temporal boundaries, saliency scores, captions) through a single shared pathway, failing to specialize for the fundamentally different nature of these subtasks.

Why it matters:

Humans effortlessly recognize actions but struggle with precise second-level temporal boundaries, a gap current models fail to bridge effectively
Standard parameter-sharing in Video-LLMs causes task interference, where learning to generate text might degrade the precision of timestamp prediction
Traditional grounding models cannot handle multiple subtasks concurrently, requiring separate models for retrieval, captioning, and highlighting

Concrete Example: In a cooking video, a model must simultaneously predict the exact start/end time of 'frying bacon', assign a saliency score to that segment, and generate a caption. A standard LLM treats the timestamp '00:45' and the word 'bacon' identically, leading to imprecise localization.

Key Novelty

TimeExpert (Task-Aware MoE for VTG)

Replaces the monolithic LLM decoder with a Mixture-of-Experts (MoE) architecture that specializes different experts for different output types (timestamps vs. text)
Introduces a dynamic routing mechanism that considers the 'type' of token being processed (e.g., score token vs. time token) to direct it to the most relevant expert
Uses a token-adaptive strategy that adds new experts if current ones are insufficient and prunes redundant ones to maintain efficiency

Architecture

The TimeExpert framework, illustrating the MoE decoder with Task-aware Dynamic Gating and Token-adaptive Routing.

Evaluation Highlights

+2.8% mAP (IoU=0.5) and +4.2% HIT@1 improvement on QVHighlights over the state-of-the-art TRACE model
+2.5% Recall@1 (IoU=0.5) on Charades-STA Moment Retrieval compared to TRACE
Achieves superior performance with fewer activated parameters (approx. 3.5B-4.8B) compared to 7B dense baselines like TimeChat and TRACE

Breakthrough Assessment

8/10

Significant architectural shift for Video-LLMs by explicitly decoupling task tokens via MoE. Strong empirical gains across multiple VTG tasks with improved efficiency.

⚙️ Technical Details

Problem Definition

Setting: Video Temporal Grounding (VTG) formulated as Causal Event Modeling

Inputs: Textual instruction I and a sequence of video frames F

Outputs: Structured event representation R consisting of discrete events, each containing a timestamp, saliency score, and textual caption

Pipeline Flow

Visual Encoding & Compression (ViT + Slot-based compression)
Tokenization (Text + Special Time/Score tokens)
MoE Decoder (Dynamic Gating + Expert Processing)
Task Heads (Time, Score, Text generation)

System Modules

Visual Encoder (Input Processing)

Extract visual features from video frames

Model or implementation: Vision Transformer (ViT) initialized from CLIP/similar

Token Embeddings (Input Processing)

Convert text instructions and special task tokens into vector representations

Model or implementation: Embedding Layer

Task-Aware Dynamic Gating (MoE Processing)

Select appropriate experts based on token content and historical task relevance

Model or implementation: Learned Gating Network with task-weighted function

Expert Layers (MoE Processing)

Process tokens using specialized sub-networks

Model or implementation: Sparse Mixture-of-Experts (Linear layers)

Novel Architectural Elements

Task-weighted gating function that explicitly incorporates historical task token activation rates (A_t) to bias routing
Dynamic expert management (Token-adaptive routing) that initializes new experts from average embeddings of under-served tokens and prunes redundant ones
Task-dependent auxiliary loss preventing expert collapse while encouraging specialization for specific token types (time vs. text)

Modeling

Base Model: ARIA (MoE Video-LLM base)

Training Method: Three-stage training: Task Module Pretraining -> MoE Decoder Pretraining -> Supervised Fine-tuning

Objective Functions:

Purpose: Ensure correct next-token prediction for text and task tokens.

Formally: Cross-Entropy Loss.
Purpose: Stabilize MoE training.

Formally: z-loss (from ST-MoE).
Purpose: Encourage experts to specialize in specific task tokens and prevent over-activation.

Formally: L_aux = λ1 * Task-Aware Concentration + λ2 * Activation Regularization.

Training Data:

Stage 1: 1.9M general multimodal video-text samples
Stage 2: 0.9M samples for MoE pretraining alignment
Stage 3: 2.3M samples for full fine-tuning (YouCook2, Charades-STA, QVHighlights, etc.)

Key Hyperparameters:

visual_tokens_per_frame: 8
visual_encoder_params: 438M
lambda_1: Not explicitly reported in the paper
+ 1 more
lambda_2: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. TRACE: Decomposes processing via MoE experts instead of a shared dense backbone
vs. TimeChat: Explicitly models structured events (time, score, text) rather than treating everything as natural language generation
vs. Uncited MoE-LLMs: Adapts MoE routing specifically for multimodal task tokens (timestamps) rather than generic text tokens [not cited in paper]

Limitations

Relies on pre-extracted visual features; end-to-end visual encoder tuning is not fully explored
Dynamic expert addition/removal introduces complexity in training stability
Performance gains on captioning metrics (CIDEr) are smaller compared to detection metrics (mAP/Recall)

Reproducibility

Code availability is not provided in the text. Training data sources (YouCook2, Charades-STA, etc.) are public benchmarks. Hyperparameters for auxiliary loss weights are not detailed in the text.

📊 Experiments & Results

Evaluation Setup

Multi-task evaluation on Video Temporal Grounding benchmarks

Benchmarks:

YouCook2 (Dense Video Captioning)
Charades-STA (Moment Retrieval)
QVHighlights (Video Highlight Detection)

Metrics:

SODA_c
CIDEr
F1 Score
Recall@1 (IoU=0.5, 0.7)
mAP (IoU=0.5, 0.75)
HIT@1
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison against state-of-the-art VTG models shows consistent improvements across all three tasks.
YouCook2	SODA_c	2.2	2.5	+0.3
YouCook2	F1 Score	22.4	23.6	+1.2
Charades-STA	R@1 (IoU=0.5)	40.3	42.8	+2.5
QVHighlights	mAP	26.8	29.6	+2.8
QVHighlights	HIT@1	42.7	46.9	+4.2
Ablation studies demonstrate the importance of token-adaptive routing and task-dependent loss.
Charades-STA	R@1 (IoU=0.5)	40.5	42.8	+2.3
Charades-STA	R@1 (IoU=0.5)	41.3	42.8	+1.5

Experiment Figures

Visualization of expert assignments for different task tokens in a vanilla MoE model.

Main Takeaways

Explicitly modeling task tokens (timestamps, scores) with specialized experts significantly outperforms shared-parameter approaches.
Dynamic gating based on token type prevents task interference, allowing improved precision in both temporal localization and caption generation concurrently.
The model achieves these gains with fewer activated parameters (sparse activation) compared to dense 7B baselines, improving efficiency.
Increasing the number of input frames generally improves performance, validating the model's capacity to handle denser temporal information.

📚 Prerequisite Knowledge

Prerequisites

Mixture-of-Experts (MoE) architecture
Video-LLM fundamentals (visual encoders + LLM decoders)
Causal language modeling

Key Terms

VTG: Video Temporal Grounding—locating precise video segments (start/end times) relevant to a text query

MoE: Mixture-of-Experts—a neural network architecture where different sub-networks (experts) are activated for different inputs

saliency score: A scalar value indicating how relevant or important a specific video segment is to the query

routing: The process of deciding which expert network processes a given input token

gating network: A small neural network that calculates probabilities to select which experts to use

auxiliary loss: An additional training objective used to guide the model towards desired behaviors (like load balancing) without being the primary goal

IoU: Intersection over Union—a metric measuring the overlap between the predicted time segment and the ground truth segment

mAP: mean Average Precision—a metric summarizing precision-recall curves, commonly used in detection tasks

CIDEr: A metric for evaluating image/video captioning quality based on consensus with human references

SODA_c: A metric tailored for video storytelling evaluation, measuring the semantic coherence of the generated story

Video-LLM: Large Language Models adapted to process video inputs, typically by projecting visual features into the LLM's token space