MM-VID: Advancing Video Understanding with GPT-4V(ision)

📝 Paper Summary

Long-form video understanding Multimodal Agents Audio Description (AD) generation

MM-VID enables GPT-4V to understand hour-long videos and interactive streams by orchestrating specialized tools to transcribe visual and audio content into a coherent textual script.

Core Problem

Large Multimodal Models like GPT-4V have limited context windows, preventing them from directly processing long-form videos (hours) or maintaining narrative coherence across multiple episodes.

Why it matters:

Current video models are typically trained on short clips (e.g., 10 seconds), failing to grasp long-term temporal dependencies in movies or sports
Real-world applications like live-streaming or gaming require continuous reasoning over dynamic environments, which static clip-based models cannot handle
Accessibility tools (Audio Descriptions) for long videos are expensive and slow to produce manually

Concrete Example: In a 50-minute documentary, a standard model might identify a person in a single frame but fails to answer 'How did the protagonist's journey change over the last hour?' due to lack of long-term memory.

Key Novelty

Video-to-Script Generation Pipeline

Treats video understanding as a text generation problem by converting the entire video into a detailed screenplay (script) first
Uses GPT-4V to generate descriptions for short clips, then uses GPT-4 to stitch these snippets into a coherent long-form narrative including dialogue and action
Integrates specialized expert tools (ASR, Scene Detection) to handle specific modalities before synthesis

Architecture

Overview of the MM-VID pipeline transforming video input into a textual script for downstream tasks

Evaluation Highlights

Achieved 8.91/10 audio quality rating from visually impaired users for generated Audio Descriptions, comparable to human-crafted descriptions (9.07/10)
Sighted users rated the timing/synchronization of MM-VID descriptions at 8.53/10, nearly matching human performance (8.59/10)
Demonstrated zero-shot capability in playing Super Mario Bros and navigating iPhone GUIs by processing streaming frames

Breakthrough Assessment

7/10

Strong engineering system applying GPT-4V to long contexts via tool use. While architecturally it combines existing API calls, the 'script generation' paradigm effectively solves the context window bottleneck for video.

⚙️ Technical Details

Problem Definition

Setting: Long-form video understanding and interactive agent control

Inputs: Video file V (frames + audio) or streaming video frames

Outputs: Textual script S, Answers to QA, or Agent Actions A

Pipeline Flow

Multimodal Pre-Processing (Tools)
Clip-Level Description (GPT-4V)
Script Generation (GPT-4)
Task Execution (GPT-4 / Agent)

System Modules

Scene Detector (Pre-Processing)

Segment the video into coherent temporal chunks based on visual transitions

Model or implementation: PySceneDetect

ASR Module (Pre-Processing)

Transcribe spoken dialogue to text

Model or implementation: Azure Cognitive Services API

Clip Describer

Generate detailed visual descriptions for each short video segment

Model or implementation: GPT-4V(ision)

Script Generator

Synthesize all multimodal signals into a single coherent long-form script

Model or implementation: GPT-4 (Text-only)

Novel Architectural Elements

Two-stage 'Video-to-Script' architecture: Visual perception (GPT-4V) is decoupled from long-term reasoning (GPT-4 text), bridged by a generated script
Integration of Visual Prompting (face snapshots) into the pipeline to improve character identification consistency

Modeling

Base Model: GPT-4V (vision-enabled) and GPT-4 (text-only)

Compute: Not reported in the paper

Comparison to Prior Work

vs. VLog: MM-VID uses GPT-4V for dense visual description instead of weaker captioners (BLIP2), enabling detailed narrative generation rather than just keyword matching
vs. Standard Video-LLMs (e.g., Video-LLaMA): Handles hour-long videos via script intermediate representation, whereas standard Video-LLMs are limited to short context windows [not cited in paper]
vs. AutoAD: MM-VID is a general-purpose system using foundational LMMs, whereas AutoAD is a specialized model trained specifically for movie description

Limitations

Hallucination: GPT-4V occasionally misidentifies objects (e.g., misclassifying a bird as a rock in blurry frames)
Latency/Cost: Processing hour-long videos frame-by-frame with GPT-4V is computationally expensive and slow
Audio Overlap: Generated audio descriptions sometimes overlap with original video dialogue
Dependency: Heavily reliant on the performance of closed-source APIs (GPT-4V)

Reproducibility

The system relies on proprietary APIs (OpenAI GPT-4V, Azure ASR). Code is not explicitly released, though demo videos are available. Prompts are partially documented in figures.

📊 Experiments & Results

Evaluation Setup

User study comparing MM-VID generated Audio Descriptions (AD) against Human-crafted AD

Benchmarks:

Audio Description User Study (Human Evaluation (Likert Scale 0-10)) [New]

Metrics:

Effectiveness of Delivery
Informativeness
Audio Quality
Overall Satisfaction
Timing and Synchronization
Statistical methodology: Means and standard deviations reported for 9 participants (4 blind/low-vision, 5 sighted)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Evaluation by visually impaired participants (N=4) comparing MM-VID against human-crafted descriptions.
Audio Description User Study	Effectiveness of Delivery (0-10)	8.33	7.14	-1.19
Audio Description User Study	Informativeness (0-10)	9.29	7.14	-2.15
Audio Description User Study	Audio Quality (0-10)	9.07	8.91	-0.16
Evaluation by sighted participants (N=5) assessing accuracy and synchronization.
Audio Description User Study	Clarity/Accuracy (0-10)	8.90	7.83	-1.07
Audio Description User Study	Timing and Synchronization (0-10)	8.59	8.53	-0.06

Experiment Figures

Bar chart comparing Human-written AD vs. MM-VID AD ratings from visually impaired participants

Main Takeaways

MM-VID generates audio descriptions that are comparable to human-crafted ones in terms of audio quality and synchronization.
Visually impaired users rated overall satisfaction lower primarily due to audio overlaps (descriptions talking over dialogue), not content quality.
The system demonstrates capability in 'Multi-Video Episodic Analysis', successfully tracking character storylines across multiple video files.
Visual prompting (face photos) significantly improves character identification and speaker attribution in the generated scripts.

📚 Prerequisite Knowledge

Prerequisites

Large Multimodal Models (GPT-4V)
Automatic Speech Recognition (ASR)
Prompt Engineering

Key Terms

GPT-4V: GPT-4 with Vision capabilities—an LMM capable of processing image and text inputs

ASR: Automatic Speech Recognition—technology that converts spoken language into text

Audio Description (AD): Narrated descriptions of a video's visual elements to make content accessible to blind or low-vision audiences

Scene Detection: Algorithms that identify transition points between different shots or scenes in a video

GUI: Graphical User Interface—visual interface (icons, menus) humans use to interact with computers

Visual Prompting: Providing visual cues (e.g., a photo of a character's face) alongside the query to help the model identify specific entities

Likert scale: A rating scale used in questionnaires (e.g., 0 to 10) to measure opinions or perceptions