VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

📝 Paper Summary

Memory organization Multi-call tool use with flexible plan

VideoAgent solves long-form video understanding by constructing a structured memory of event captions and tracked objects, which an LLM interactively queries using specialized tools.

Core Problem

End-to-end multimodal LLMs struggle with long-form videos due to prohibitive memory costs and attention limitations, failing to capture complex spatial-temporal dependencies and object details.

Why it matters:

Processing lengthy videos with standard transformers is computationally expensive and often exceeds context windows
Self-attention mechanisms frequently fail to capture long-range relations necessary for reasoning about causal events or specific object states over time
Current agent-based approaches lack video-specific designs, leading to complicated pipelines that still underperform compared to end-to-end models

Concrete Example: When asked 'What is the relationship between the boy and the adults?' in a long video, an end-to-end model might hallucinate based on a few frames. VideoAgent retrieves specific segments (9 and 13) showing the boy playing while adults supervise, then synthesizes these observations to infer they are likely parents.

Key Novelty

Unified Memory Mechanism for Video Agents

Constructs a dual-memory system: 'Temporal Memory' for generic event descriptions (captions) and 'Object Memory' for tracking specific object states and trajectories via a database.
Equips an LLM with a minimalist set of tools (e.g., segment localization, SQL-based object querying) to iteratively retrieve only relevant information from this structured memory rather than processing the whole video at once.

Architecture

The dual-phase pipeline: Memory Construction (left) and Inference/Tool-Use (right).

Evaluation Highlights

+26.0% accuracy improvement on EgoSchema (long-form reasoning) compared to Video-LLaVA baseline
+6.6% average accuracy improvement on NExT-QA compared to SeViLA baseline
Outperforms Gemini 1.5 Pro on EgoSchema subset (62.8 vs 63.2 is comparable, surpasses on subset logic in text)

Breakthrough Assessment

8/10

Significantly closes the gap between open-source models and proprietary giants (Gemini 1.5 Pro) on challenging long-video benchmarks by using a structured memory approach rather than raw context scaling.

⚙️ Technical Details

Problem Definition

Setting: Video Question Answering and Temporal Localization on long-form videos

Inputs: Long-form video V and a natural language query q

Outputs: Textual response a (answer or time window)

Pipeline Flow

Memory Construction: Process video into Temporal Memory (captions) and Object Memory (tracking DB)
Inference Loop: LLM receives query → Selects Tool (Localization, Caption Retrieval, VQA, Object Query) → Updates Context → Repeats
Final Response: LLM synthesizes gathered information into answer

System Modules

Temporal Memory Constructor (Memory Construction)

Slice video into segments and generate textual descriptions and embeddings

Model or implementation: LaViLa (captioning), ViCLIP (video features), text-embedding-3-large (text features)

Object Memory Constructor (Memory Construction)

Detect, track, and re-identify objects to build a queryable database

Model or implementation: RT-DETR (detection), ByteTrack (tracking), CLIP + DINOv2 (Re-ID features)

Planner / Reasoner

Decompose query and orchestrate tool use

Model or implementation: GPT-4

Tool: Object Memory Querying (Tools)

Retrieve object-specific information via SQL

Model or implementation: CLIP (text encoder) + SQL engine

Tool: Segment Localization (Tools)

Find relevant video segments for a text query

Model or implementation: ViCLIP (video-text similarity) + OpenAI text embeddings

Tool: Visual Question Answering (Tools)

Answer specific visual details on a short segment

Model or implementation: Video-LLaVA

Novel Architectural Elements

Dual-memory architecture separating 'generic event context' (Temporal Memory) from 'specific object states' (Object Memory)
SQL-based interface for LLM to query object trajectories in videos, bridging unstructured video data with structured database logic

Modeling

Base Model: GPT-4 (Planner), Video-LLaVA (VQA Tool)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Video-LLaVA: VideoAgent uses structured memory + tool use instead of end-to-end processing, handling longer videos efficiently
vs. ViperGPT: VideoAgent uses a unified memory and specialized object tracking database, whereas ViperGPT relies on on-the-fly API calls without persistent structured video representation
vs. LifeLongMemory: VideoAgent includes explicit Object Memory (tracking/Re-ID) and SQL querying, not just textual captions
+ 1 more
vs. DoraemonGPT [not cited in paper]: DoraemonGPT uses MCTS for planning; VideoAgent uses a simpler iterative loop but a more structured Object/Temporal memory design

Limitations

Dependency on the performance of off-the-shelf foundation models (LaViLa, ByteTrack, etc.)
Object memory construction cost can be high for videos with extremely dense objects
Open-ended QA performance is slightly lower than GPT-4V which has direct visual access (VideoAgent relies on intermediate tools)

Reproducibility

Code: https://videoagent.github.io

publicly available (https://videoagent.github.io). Code and demo provided. Relies on proprietary APIs (GPT-4, OpenAI embeddings) and open-source models (LaViLa, Video-LLaVA, RT-DETR).

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on video understanding benchmarks

Benchmarks:

EgoSchema (Long-form video QA (multiple choice))
Ego4D NLQ (Natural Language Queries (Temporal Localization))
WorldQA (Video QA (Open-ended & Multi-choice))
NExT-QA (Video QA (Temporal/Causal/Descriptive))

Metrics:

Accuracy (Acc)
Recall@k (R1@0.3, R5@0.3)
IoU (Intersection over Union)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
EgoSchema results demonstrate VideoAgent's superiority on long-form video reasoning tasks.
EgoSchema (subset)	Accuracy	36.8	62.8	+26.0
EgoSchema (full set)	Accuracy	50.3	60.2	+9.9
NExT-QA results highlight improvements in causal and temporal reasoning.
NExT-QA (subset)	Average Accuracy	64.2	70.8	+6.6
NExT-QA (subset)	Average Accuracy	53.5	70.8	+17.3
Ablation studies on NExT-QA confirm the importance of each memory component.
NExT-QA (subset)	Average Accuracy	40.7	70.8	+30.1
NExT-QA (subset)	Average Accuracy	48.7	56.0	+7.3

Experiment Figures

Conceptual comparison between End-to-End models and VideoAgent.

A step-by-step execution trace of VideoAgent answering a question about relationships.

Main Takeaways

Structured memory (Temporal + Object) is more effective for long-form video understanding than increasing context window or simple frame sampling.
Object Re-Identification (Re-ID) significantly boosts performance on descriptive questions involving object counting and state tracking.
Minimalist toolset (Retrieve, Localize, VQA, Object Query) is sufficient to outperform complex pipelines, suggesting 'less is more' when tools are well-aligned with memory structure.
VideoAgent effectively bridges the gap between open-source models and state-of-the-art proprietary models like Gemini 1.5 Pro on complex benchmarks.

📚 Prerequisite Knowledge

Prerequisites

Large Language Models (LLMs) and tool-use (function calling)
Video-Language Models (e.g., CLIP, LaViLa)
Object Detection and Tracking

Key Terms

Temporal Memory: A structured storage of short video segment descriptions (captions) and their embedding features, indexed by time

Object Memory: A database tracking object occurrences, containing a feature table for visual similarity and a SQL database for querying relationships and timelines

LaViLa: A video captioning model that generates detailed textual descriptions for video clips, used here to populate Temporal Memory

Re-ID: Re-identification—the process of determining whether different tracked object instances across video frames correspond to the same unique object entity

CLIP: Contrastive Language-Image Pre-training—a model that aligns text and images in a shared embedding space, used here for feature matching

ByteTrack: A multi-object tracking algorithm used to associate detection boxes across frames

ViCLIP: A video-text retrieval model used to compute similarity between text queries and video segments

Video-LLaVA: A multimodal LLM used here as a specific tool for Visual Question Answering on short retrieved segments