Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding

📝 Paper Summary

Long-form video understanding Multimodal agents

Deep Video Discovery (DVD) treats long-form video understanding as a multi-step search problem, using an LLM agent to iteratively query a multi-granular video database via specialized tools.

Core Problem

Current Large Language Models struggle with hour-long videos due to context length limits and information density, leading to poor reasoning and instruction following.

Why it matters:

Existing video agents rely on rigid, manually designed workflows (e.g., tree searches) that cannot adapt to diverse query types
Token compression techniques for long context handling introduce information loss and uncertainty
Retrieving fine-grained details from hour-long content requires navigating both global context and pixel-level specifics, which fixed-structure methods fail to balance efficiently

Concrete Example: Existing tree-based search methods navigate from root to leaf nodes, which is inefficient for fine-grained queries requiring direct leaf access. Furthermore, relevant entities might not be temporally close, making backdate mechanisms in tree searches inefficient.

Key Novelty

Autonomous Agentic Search over Multi-granular Video Database

Conceptually reframes video understanding as an iterative 'search and discovery' process rather than a single-pass processing task
Constructs a hierarchical database containing global summaries, clip-level captions, and raw frames
Empowers an LLM agent with modular tools (Browse, Search, Inspect) to autonomously plan and execute search strategies based on the query's specific needs

Architecture

Overview of the Deep Video Discovery framework, showing the offline database construction (Left) and the online Agentic Search and Answer process (Right).

Evaluation Highlights

Achieves 74.2% accuracy on the LVBench dataset, setting a new state-of-the-art
Further improves LVBench accuracy to 76.0% when auxiliary transcripts are used
Substantially surpasses all prior works on LVBench by a large margin

Breakthrough Assessment

8/10

Significant performance jump on a challenging benchmark (LVBench) by shifting from fixed processing pipelines to a fully agentic, search-based paradigm.

⚙️ Technical Details

Problem Definition

Setting: Long-form video question answering via agentic information retrieval

Inputs: Ultra-long video V and user query Q

Outputs: Final answer to the user query

Pipeline Flow

Data Prep: Segment video -> Generate Captions/Summaries -> Build Database
Inference: Agent Loop (Observe -> Reason -> Select Tool -> Act -> Update History)
Tools: Global Browse / Clip Search / Frame Inspect

System Modules

Multi-granular Database Builder

Pre-processes video into retrievable tiers: global summaries, clip captions, and raw frames

Model or implementation: Large VLM (for captioning)

Agent Orchestrator

Iteratively reasons, plans, and selects tools to gather information

Model or implementation: LLM (Reasoning Model)

Global Browse Tool (Search Tools)

Provides high-level context via pre-computed subject summaries or query-specific event summaries

Model or implementation: VLM (for event summarization)

Clip Search Tool (Search Tools)

Retrieves relevant video segments using semantic similarity matching

Model or implementation: Embedding Model

Frame Inspect Tool (Search Tools)

Examines raw pixels for fine-grained visual details not present in captions

Model or implementation: VLM

Novel Architectural Elements

Multi-granular video database structure combining global registry, vector-searchable captions, and raw frame access
Agentic search loop replacing fixed-tree or linear-scan workflows for video QA
Progressive structured subject registry during captioning to maintain identity consistency

Modeling

Base Model: Large Language Model (specific model name not explicitly cited in text extracts, likely GPT-4o or similar high-capacity model given 'Deep Video Discovery' naming and Microsoft affiliation, but distinct text specifies 'Reasoning LLM')

Compute: Not reported in the paper

Comparison to Prior Work

vs. VideoTree/VCA: DVD uses autonomous, adaptive tool selection rather than a fixed hierarchical tree workflow, allowing direct leaf access and flexible planning
vs. AdaRETAKE: DVD uses agentic retrieval over a database rather than token compression, avoiding information loss from compression artifacts

Limitations

Depends on the capabilities of the underlying LLM and VLM; failures in base models propagate to the agent
Computational cost of constructing the multi-granular database (captioning every 5s clip) may be high
Frame Inspect tool is limited to 50 frames, potentially missing details in longer requested intervals

Reproducibility

Code: https://github.com/microsoft/DeepVideoDiscovery

Code is publicly released at https://github.com/microsoft/DeepVideoDiscovery. The paper describes the database construction and agent loop algorithms in detail.

📊 Experiments & Results

Evaluation Setup

Evaluation on long-form video understanding benchmarks

Benchmarks:

LVBench (Long-form video question answering)

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
LVBench	Accuracy	Not reported in the paper	74.2	Not reported in the paper
LVBench	Accuracy	74.2	76.0	+1.8

Main Takeaways

The agentic approach allows for adaptive planning, handling diverse query types better than fixed workflows
Multi-granular tools (Global, Clip, Frame) are effectively orchestrated by the LLM to zoom in from high-level context to pixel details
State-of-the-art performance on LVBench demonstrates the efficacy of the search-based paradigm for hour-long videos

📚 Prerequisite Knowledge

Prerequisites

Large Language Models (LLMs) and Vision-Language Models (VLMs)
Agentic workflows (ReAct)
Vector embeddings and retrieval

Key Terms

ReAct: Reason-Act—a paradigm where agents generate reasoning traces before taking actions

VLM: Vision-Language Model—a model capable of understanding and generating text based on visual inputs

LVBench: A benchmark dataset designed for evaluating long-form video understanding

VQA: Visual Question Answering—the task of answering natural language questions about visual content

Clip: A short, continuous segment of a video (here, 5 seconds)

Subject Registry: A structured record of entities (people, objects) tracked throughout the video to maintain consistency in captions