RAVEN: An Agentic Framework for Multimodal Entity Discovery from Large-Scale Video Collections

📝 Paper Summary

Video Understanding Multimodal Information Extraction Agentic AI

RAVEN is a model-agnostic agent that dynamically generates domain-specific schemas from video collections to guide vision-language models in extracting structured, multimodal entities.

Core Problem

Current multimodal models typically process videos in isolation, lacking collection-wide understanding and domain-specific structure needed for large-scale retrieval.

Why it matters:

Video collections in education or entertainment require consistent structured metadata (entities, attributes) which isolated processing fails to provide
Existing methods lack mechanisms to dynamically define what entities matter for a specific domain (e.g., 'Ingredients' in cooking vs. 'Dates' in history)
Unimodal baselines (OCR, Speech-to-Text) miss context that requires synthesizing visual, audio, and textual cues

Concrete Example: In a historical documentary, a standard object detector might label a person as 'Person', while RAVEN, utilizing a history-specific schema, extracts 'Person: Napoleon', 'Role: Emperor', and 'Event: Battle of Waterloo' by synthesizing speech and visual context.

Key Novelty

Two-stage Agentic Schema Generation & Extraction

First, an agent scans the collection to discover categories and generates a 'schema' (a template of expected entities and attributes) using an LLM
Second, this dynamic schema is used to prompt a Vision-Language Model (VLM) for the actual extraction, ensuring the model looks for domain-relevant details rather than generic labels

Architecture

The RAVEN inference pipeline showing the two-stage process: Category Understanding and Rich Entity Extraction

Evaluation Highlights

Scaled successfully to 1.5 million video clips (>5000 hours) from the Aligned Video Captions dataset for category and schema discovery
Qualitatively outperformed unimodal baselines (OCR, Speech NER, YOLO) in extracting rich attributes and relationships (e.g., Person → Role) in a 300-clip benchmark
Demonstrated ability to dynamically generate distinct schemas for diverse domains like 'How-To' (Ingredients, Tools) vs. 'History' (Figures, Dates) without manual rules

Breakthrough Assessment

7/10

Proposed a practical, scalable agentic workflow for structuring massive video datasets. While it relies on off-the-shelf models, the dynamic schema generation approach effectively bridges generalist VLMs and domain-specific retrieval needs.

⚙️ Technical Details

Problem Definition

Setting: Multimodal entity extraction from large-scale video collections

Inputs: A collection of raw video clips (visual + audio)

Outputs: Structured JSON containing canonical categories, entities, and attributes per video

Pipeline Flow

Category Discovery: VLM → Raw Categories
Schema Generation: LLM → Canonical Categories + Domain Schemas
Entity Extraction: Video + Retrieved Schema → VLM → Structured Entities

System Modules

Category Inferrer

Process raw video/audio to suggest initial content categories and generic entities

Model or implementation: Gemini 1.5 Flash

Schema Generator

Normalize raw categories into a canonical list and generate a specific entity schema for each

Model or implementation: GPT-4o

Schema Retriever (Rich Extraction)

Select the appropriate schema for a video based on semantic similarity

Model or implementation: Not explicitly specified (likely vector similarity or LLM selection)

Rich Entity Extractor (Rich Extraction)

Extract detailed entities and attributes strictly following the retrieved schema

Model or implementation: Gemini 1.5 Flash

Novel Architectural Elements

Dynamic Schema Generation Loop: The system defines its own extraction targets (schemas) based on the data distribution before performing the final extraction
Model-Agnostic Agent Design: Decouples the VLM (video perception) from the LLM (schema logic), allowing distinct models for perception vs. reasoning

Modeling

Base Model: Gemini 1.5 Flash (VLM) and GPT-4o (LLM)

Compute: Not reported in the paper (Inference-only framework using API-based models)

Comparison to Prior Work

vs. Unimodal Baselines (NER, OCR): RAVEN synthesizes audio, visual, and text, whereas baselines only see one modality
vs. YOLO/Object Detection: RAVEN extracts attributes and relationships (context), whereas YOLO provides only bounding box labels
vs. Standard Multimodal VLMs: RAVEN uses a collection-level schema to enforce structure, whereas standard VLM prompting yields unstructured or generic outputs

Limitations

Performance depends entirely on the capabilities of the underlying off-the-shelf models (Gemini/GPT-4o)
No quantitative error analysis or specific metric scores (precision/recall) provided in the text
Latency and cost implications of two-pass processing (Discovery + Extraction) are not discussed

Reproducibility

The framework uses closed-source API models (Gemini 1.5 Flash, GPT-4o). No code repository or specific prompt templates are provided in the paper text.

📊 Experiments & Results

Evaluation Setup

Qualitative and distribution analysis on large-scale video data

Benchmarks:

Aligned Video Captions Dataset (Video Entity Extraction)

Metrics:

Qualitative Recall (visualized, not tabulated)
Entity Attribute Richness
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

A comparison of entity extraction performance between RAVEN and baselines (NER, OCR, YOLO, Captioning)

Main Takeaways

Multimodal synthesis is critical: Unimodal baselines (OCR, Speech NER) fail to capture context (e.g., identifying a person via face but knowing their role via speech)
Dynamic schemas enable domain specificity: The framework successfully generated distinct entity types for 'History' videos (e.g., Historical Figure) versus 'How-To' videos (e.g., Tools) without manual hard-coding
Qualitative superiority: The paper asserts significantly improved recall rates over baselines (visualized in Figure 5, though exact numbers are not in text), particularly for attributes and relationships

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and Vision-Language Models (VLMs)
Basic knowledge of Named Entity Recognition (NER)
Familiarity with retrieval-augmented generation (RAG) concepts

Key Terms

VLM: Vision-Language Model—an AI model capable of processing both images/video and text to understand visual content

LLM: Large Language Model—an AI model trained on vast text data to generate and understand human language

Schema: A structured template defining what types of entities (e.g., 'Ingredient') and attributes (e.g., 'Quantity') to extract for a specific category

Canonicalization: The process of normalizing diverse raw category names into a standardized, duplicate-free list

Agentic Framework: A system where AI models act as autonomous agents, planning steps (like first defining a schema, then using it) to achieve a complex goal

NER: Named Entity Recognition—identifying specific items like names, dates, and locations in text or speech

OCR: Optical Character Recognition—converting text shown visually in images/video frames into machine-readable text