TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

📝 Paper Summary

Video Temporal Grounding (VTG) Multimodal Large Language Models (MLLMs)

TimeLens establishes a reliable baseline for video temporal grounding by rigorously cleaning defective benchmarks and applying thinking-free reinforcement learning on high-quality re-annotated data.

Core Problem

Existing video temporal grounding benchmarks are rife with low-quality queries (ambiguous, non-unique) and inaccurate timestamps, causing misleading evaluations where models learn shortcuts rather than temporal perception.

Why it matters:

Current benchmarks misguide research: open-source models often 'win' on legacy leaderboards by exploiting dataset biases, while robust proprietary models rank poorly, a trend that reverses on clean data
Without accurate temporal grounding, MLLMs cannot reliably answer 'when' events happen, limiting their utility in fine-grained video understanding and reasoning systems

Concrete Example: In Charades-STA, multiple nearly identical queries often describe the same event, or queries like 'ending credits' leak temporal position. A model trained on this might output a correct timestamp based on text shortcuts without watching the video, failing when tested on unique, perception-dependent queries.

Key Novelty

TimeLens (Data Curation + RLVR Algorithm)

Constructs 'TimeLens-Bench' by manually auditing and re-annotating popular benchmarks (Charades-STA, ActivityNet, QVHighlights) to enforce strict uniqueness and precision criteria
Develops 'TimeLens-100K', a large-scale training set created via automated re-annotation of noisy corpora using advanced MLLMs
Proposes a 'thinking-free' RLVR training recipe that optimizes directly for IoU (Intersection over Union) using interleaved textual timestamp encoding, avoiding complex architectural add-ons

Architecture

Conceptual framework for building TimeLens MLLMs, highlighting the dual focus on Data Quality and Algorithmic Design.

Evaluation Highlights

Discovers 20.6% of samples in Charades-STA violate query uniqueness and 34.9% have annotation accuracy issues
Reverses model rankings: Proprietary models (Gemini-1.5-Pro) significantly outperform open-source baselines on the refined TimeLens-Bench, whereas they lagged behind on legacy benchmarks
TimeLens-8B (based on Qwen3-VL) achieves state-of-the-art performance on TimeLens-Bench, surpassing proprietary models like GPT-5 and Gemini-2.5-Flash

Breakthrough Assessment

8/10

Critically exposes severe flaws in standard benchmarks that have likely skewed the field. The proposed data pipeline and strong baseline (TimeLens) reset the standard for future VTG research.

⚙️ Technical Details

Problem Definition

Setting: Video Temporal Grounding (VTG): Given a video and text query, localize the specific time segment corresponding to the query.

Inputs: Video v, Text Query q

Outputs: Temporal segment S = (t_start, t_end)

Pipeline Flow

Video Input -> Visual Encoder
Text Query -> Text Encoder
Multimodal Fusion (LLM Backbone)
Interleaved Textual Timestamp Generation

System Modules

Visual Encoder

Extract visual features from video frames

Model or implementation: Not explicitly specified (likely Qwen-VL internal encoder)

LLM Backbone

Process visual and text tokens to determine temporal boundaries

Model or implementation: Qwen2.5-VL-7B or Qwen3-VL-8B

Timestamp Generator

Output start and end timestamps as text strings

Model or implementation: Shared LLM Head

Novel Architectural Elements

Uses a simple interleaved textual encoding for timestamps rather than complex dedicated heads or special embeddings, finding it superior for VTG

Modeling

Base Model: Qwen2.5-VL-7B and Qwen3-VL-8B

Training Method: Reinforcement Learning with Verifiable Rewards (RLVR)

Objective Functions:

Purpose: Optimize model policy to maximize overlap with ground truth segments.

Formally: RL objective maximizing reward r(S_pred, S_gt) based on IoU.

Training Data:

TimeLens-100K: Automated re-annotation of 100K samples from DiDeMo, Coin, etc., using advanced MLLMs to filter bad queries and refine timestamps
TimeLens-Bench: Manual re-annotation of Charades-STA, ActivityNet Captions, QVHighlights

Compute: Not reported in the paper

Comparison to Prior Work

vs. Time-R1: TimeLens uses a 'thinking-free' approach (no Chain of Thought) and trains on cleaner data (TimeLens-100K), achieving higher performance
vs. Gemini-1.5-Pro: TimeLens is open-source and fine-tuned specifically for VTG, surpassing Gemini on the rigorous TimeLens-Bench
vs. General MLLMs (Qwen-VL): TimeLens introduces specific RLVR recipes (early stopping, difficulty sampling) tailored for temporal grounding

Limitations

Relies on automated re-annotation for training data which may still contain residual noise compared to manual evaluation data
Does not introduce a novel model architecture, focusing instead on data and training recipes
Performance comparison relies heavily on the newly introduced TimeLens-Bench

📊 Experiments & Results

Evaluation Setup

Video Temporal Grounding on refined benchmarks

Benchmarks:

TimeLens-Bench (Video Temporal Grounding) [New]
Charades-STA (Video Temporal Grounding)
ActivityNet Captions (Video Temporal Grounding)
QVHighlights (Video Temporal Grounding)

Metrics:

R1@m (m=0.3, 0.5, 0.7)
mIoU
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Data quality analysis reveals severe issues in legacy benchmarks, quantifying the extent of unreliable data.
Charades-STA	Query Uniqueness Violation Rate	0	20.6	+20.6
Charades-STA	Annotation Accuracy Issue Rate	0	34.9	+34.9

Experiment Figures

Ranking change of models between Legacy Benchmarks and TimeLens-Bench.

Distribution of error types across Charades-STA, ActivityNet, and QVHighlights.

Main Takeaways

Legacy benchmarks (Charades-STA, etc.) are unreliable: a significant portion of queries are non-unique or inaccurate, rewarding models that exploit shortcuts rather than visual perception.
Evaluation Reversal: On legacy benchmarks, open-source models often beat proprietary ones; on the rigorous TimeLens-Bench, this trend reverses, with proprietary models (like Gemini) performing much better, validating the new benchmark's quality.
TimeLens-8B (Qwen3-VL based) sets a new state-of-the-art on TimeLens-Bench, surpassing even proprietary giants like GPT-5 and Gemini-2.5-Flash, demonstrating the efficacy of the RLVR + High-Quality Data recipe.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Multimodal LLMs (MLLMs) and their tokenization
Familiarity with Reinforcement Learning (RL) concepts like rewards and policy optimization
Knowledge of VTG metrics like IoU (Intersection over Union)

Key Terms

VTG: Video Temporal Grounding—the task of finding exact start and end times for a text query within a video

RLVR: Reinforcement Learning with Verifiable Rewards—a training method where the model is optimized using a ground-truth verifier (like IoU) rather than a learned reward model

IoU: Intersection over Union—a metric measuring the overlap between the predicted time segment and the ground truth segment

SFT: Supervised Fine-Tuning—standard training on labeled data before applying RL

thinking-free: A model approach that outputs answers directly without generating intermediate 'reasoning' or 'thought' tokens

interleaved textual encoding: Representing time by inserting text tokens (e.g., '<0.5>') directly into the sequence, rather than using special learned embeddings

TimeLens-Bench: The author's newly curated, high-quality evaluation suite derived from re-annotating existing datasets

TimeLens-100K: The author's newly created training dataset, generated by automatically fixing labels in existing corpora