Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo, Jiahao Li, Bin Li, Houqiang Li, Yan Lu
Microsoft Research Asia,
University of Science and Technology of China
arXiv.org
(2025)
MMAgentReasoningQABenchmark
📝 Paper Summary
Long-form video understandingMultimodal agents
Deep Video Discovery (DVD) treats long-form video understanding as a multi-step search problem, using an LLM agent to iteratively query a multi-granular video database via specialized tools.
Core Problem
Current Large Language Models struggle with hour-long videos due to context length limits and information density, leading to poor reasoning and instruction following.
Why it matters:
Existing video agents rely on rigid, manually designed workflows (e.g., tree searches) that cannot adapt to diverse query types
Token compression techniques for long context handling introduce information loss and uncertainty
Retrieving fine-grained details from hour-long content requires navigating both global context and pixel-level specifics, which fixed-structure methods fail to balance efficiently
Concrete Example:Existing tree-based search methods navigate from root to leaf nodes, which is inefficient for fine-grained queries requiring direct leaf access. Furthermore, relevant entities might not be temporally close, making backdate mechanisms in tree searches inefficient.
Key Novelty
Autonomous Agentic Search over Multi-granular Video Database
Conceptually reframes video understanding as an iterative 'search and discovery' process rather than a single-pass processing task
Constructs a hierarchical database containing global summaries, clip-level captions, and raw frames
Empowers an LLM agent with modular tools (Browse, Search, Inspect) to autonomously plan and execute search strategies based on the query's specific needs
Architecture
Overview of the Deep Video Discovery framework, showing the offline database construction (Left) and the online Agentic Search and Answer process (Right).
Evaluation Highlights
Achieves 74.2% accuracy on the LVBench dataset, setting a new state-of-the-art
Further improves LVBench accuracy to 76.0% when auxiliary transcripts are used
Substantially surpasses all prior works on LVBench by a large margin
Breakthrough Assessment
8/10
Significant performance jump on a challenging benchmark (LVBench) by shifting from fixed processing pipelines to a fully agentic, search-based paradigm.
⚙️ Technical Details
Problem Definition
Setting: Long-form video question answering via agentic information retrieval
Inputs: Ultra-long video V and user query Q
Outputs: Final answer to the user query
Pipeline Flow
Data Prep: Segment video -> Generate Captions/Summaries -> Build Database
Tools: Global Browse / Clip Search / Frame Inspect
System Modules
Multi-granular Database Builder
Pre-processes video into retrievable tiers: global summaries, clip captions, and raw frames
Model or implementation: Large VLM (for captioning)
Agent Orchestrator
Iteratively reasons, plans, and selects tools to gather information
Model or implementation: LLM (Reasoning Model)
Global Browse Tool (Search Tools)
Provides high-level context via pre-computed subject summaries or query-specific event summaries
Model or implementation: VLM (for event summarization)
Clip Search Tool (Search Tools)
Retrieves relevant video segments using semantic similarity matching
Model or implementation: Embedding Model
Frame Inspect Tool (Search Tools)
Examines raw pixels for fine-grained visual details not present in captions
Model or implementation: VLM
Novel Architectural Elements
Multi-granular video database structure combining global registry, vector-searchable captions, and raw frame access
Agentic search loop replacing fixed-tree or linear-scan workflows for video QA
Progressive structured subject registry during captioning to maintain identity consistency
Modeling
Base Model: Large Language Model (specific model name not explicitly cited in text extracts, likely GPT-4o or similar high-capacity model given 'Deep Video Discovery' naming and Microsoft affiliation, but distinct text specifies 'Reasoning LLM')
Compute: Not reported in the paper
Comparison to Prior Work
vs. VideoTree/VCA: DVD uses autonomous, adaptive tool selection rather than a fixed hierarchical tree workflow, allowing direct leaf access and flexible planning
vs. AdaRETAKE: DVD uses agentic retrieval over a database rather than token compression, avoiding information loss from compression artifacts
Limitations
Depends on the capabilities of the underlying LLM and VLM; failures in base models propagate to the agent
Computational cost of constructing the multi-granular database (captioning every 5s clip) may be high
Frame Inspect tool is limited to 50 frames, potentially missing details in longer requested intervals
Code is publicly released at https://github.com/microsoft/DeepVideoDiscovery. The paper describes the database construction and agent loop algorithms in detail.
📊 Experiments & Results
Evaluation Setup
Evaluation on long-form video understanding benchmarks
Benchmarks:
LVBench (Long-form video question answering)
Metrics:
Accuracy
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
LVBench
Accuracy
Not reported in the paper
74.2
Not reported in the paper
LVBench
Accuracy
74.2
76.0
+1.8
Main Takeaways
The agentic approach allows for adaptive planning, handling diverse query types better than fixed workflows
Multi-granular tools (Global, Clip, Frame) are effectively orchestrated by the LLM to zoom in from high-level context to pixel details
State-of-the-art performance on LVBench demonstrates the efficacy of the search-based paradigm for hour-long videos
📚 Prerequisite Knowledge
Prerequisites
Large Language Models (LLMs) and Vision-Language Models (VLMs)
Agentic workflows (ReAct)
Vector embeddings and retrieval
Key Terms
ReAct: Reason-Act—a paradigm where agents generate reasoning traces before taking actions
VLM: Vision-Language Model—a model capable of understanding and generating text based on visual inputs
LVBench: A benchmark dataset designed for evaluating long-form video understanding
VQA: Visual Question Answering—the task of answering natural language questions about visual content
Clip: A short, continuous segment of a video (here, 5 seconds)
Subject Registry: A structured record of entities (people, objects) tracked throughout the video to maintain consistency in captions