Zero-Shot Video Question Answering with Procedural Programs

📝 Paper Summary

Modular Visual Question Answering Code Generation for Vision Neuro-symbolic AI

ProViQ answers zero-shot video questions by using an LLM to generate Python programs that invoke a library of pre-trained visual modules (detection, tracking, captioning) to reason procedurally.

Core Problem

Existing video QA methods rely on end-to-end training on limited datasets, struggling to generalize to new questions (zero-shot) or handle complex procedural reasoning steps.

Why it matters:

Current supervised models fail to generalize outside their training distributions (e.g., from Kinetics to NeXT-QA)
End-to-end black-box models lack interpretability, making it hard to diagnose whether errors come from perception or reasoning
Humans solve video queries procedurally (find frame -> find object -> check attribute), but standard models cannot explicitly execute these discrete steps

Concrete Example: Question: 'What color jacket did the skier in orange pants wear?' An end-to-end model might guess based on the most common skier jacket color. ProViQ generates code to: 1) filter frames for skiers, 2) find the skier with orange pants, 3) crop that skier, and 4) query the jacket color.

Key Novelty

Procedural Video Querying (ProViQ)

Extends visual programming (like ViperGPT) to the video domain by introducing a video-specific API (tracking, transcriptions, temporal filtering)
Treats video QA as code generation: an LLM writes a Python script that calls pre-trained vision tools to solve the question step-by-step
Uses 'in-context learning' with example programs in the prompt to teach the LLM how to use the custom video API without any model training

Architecture

The ProViQ pipeline: Prompt Construction -> LLM Code Generation -> Program Execution -> Answer.

Evaluation Highlights

+25% accuracy improvement on the ActivityNet-QA open-ended benchmark compared to previous zero-shot methods
+25% accuracy gain on the challenging long-form EgoSchema benchmark over prior state-of-the-art
Achieves state-of-the-art zero-shot performance across 7 different video QA benchmarks, including open-ended, multiple-choice, and multimodal datasets

Breakthrough Assessment

8/10

Significant leap in zero-shot performance (+25%) by successfully adapting code-generation techniques to video. Demonstrates that modular, training-free approaches can outperform supervised baselines on complex reasoning tasks.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot Video Question Answering (VideoQA)

Inputs: Input video V and a natural language query/question Q

Outputs: A text answer A (open-ended) or a selection from multiple choices

Pipeline Flow

Prompt Construction (API docs + Examples + Question)
Code Generation (LLM generates Python program)
Program Execution (Python interpreter runs code)
Answer Refinement (Map output to vocabulary)

System Modules

Code Generator

Converts natural language question into an executable Python script using the provided API

Model or implementation: GPT-3.5-turbo

Visual Module: filter_property (Video API)

Finds all frames in a video that satisfy a boolean predicate (e.g., 'Is the person running?')

Model or implementation: BLIP-2

Visual Module: filter_object / find (Video API)

Detects specific objects to filter frames or return bounding box crops

Model or implementation: GroundingDINO

Visual Module: video_query (Video API)

Answers a question about a collection of frames using frame-wise voting

Model or implementation: BLIP-2

Visual Module: get_summary (Video API)

Generates a paragraph summary of the video narrative

Model or implementation: LaViLa (video-to-text) + LLM aggregator

Visual Module: track_objects (Video API)

Associates detections over continuous frames into tracks

Model or implementation: ByteTrack

Answer Refinement

Maps the program's raw string output to the nearest valid answer in the dataset vocabulary

Model or implementation: FastText (embeddings)

Novel Architectural Elements

Extension of ViperGPT's image-centric API to include temporal video modules (track_objects, get_summary, get_script)
Hierarchical summarization module (get_summary) combining dense video captioning (LaViLa) with LLM aggregation for long-form narrative understanding

Modeling

Base Model: GPT-3.5-turbo (Code Generation), BLIP-2 (VQA), GroundingDINO (Detection)

Compute: Single Nvidia A100 GPU for inference (evaluation split over multiple GPUs for speed)

Comparison to Prior Work

vs. ViperGPT: ProViQ adds temporal modules (tracking, summarization, subtitles) enabling video reasoning, whereas ViperGPT processes frames independently.
vs. FrozenBiLM/Just Ask: ProViQ requires NO training/fine-tuning on video-text pairs; it uses off-the-shelf models coordinated by code.
vs. InternVideo: ProViQ generates explicit, interpretable programs rather than using a black-box embedding approach.
+ 1 more
vs. VideoChat [not cited in paper]: ProViQ uses programmatic reasoning with discrete tools, whereas VideoChat uses a conversational LLM with soft visual prompts.

Limitations

Dependency on the quality of underlying visual modules (e.g., failure in object detection propagates to reasoning)
Vulnerable to annotation errors in datasets (modules might be correct but labels wrong/ambiguous)
Code generation errors (LLM may hallucinate methods or write non-compilable code, though mitigated by in-context examples)
Inference latency is likely higher than end-to-end models due to sequential module execution (implied by method, not explicitly quantified)

Reproducibility

Code: https://github.com/rohan-choudhury/proviq

Code is publicly available (https://github.com/rohan-choudhury/proviq). The paper utilizes closed-source GPT-3.5-turbo API. It relies on pre-trained checkpoints for visual modules (BLIP-2, GroundingDINO, LaViLa, ByteTrack, Whisper).

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on diverse video QA benchmarks (no training on target datasets).

Benchmarks:

TGIF-QA (Open-ended QA (GIFs))
MSRVTT-QA (Open-ended QA (Web videos))
MSVD-QA (Open-ended QA (Web videos))
ActivityNet-QA (Open-ended QA (Long videos))
iVQA (Open-ended QA (Instructional))
TVQA (Multiple-choice QA (Multimodal/TV shows))
EgoSchema (Multiple-choice QA (Long-form Egocentric))
NeXT-QA (Multiple-choice QA (Causal/Temporal))

Metrics:

Top-1 Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ProViQ achieves state-of-the-art results on open-ended VideoQA benchmarks, significantly outperforming previous zero-shot methods and sometimes supervised methods.
ActivityNet-QA	Accuracy	25.9	42.3	+16.4
TGIF-QA	Accuracy	41.9	66.1	+24.2
iVQA	Accuracy	26.8	50.7	+23.9
ProViQ demonstrates strong performance on long-form and multiple-choice benchmarks.
EgoSchema	Accuracy	32.1	57.1	+25.0
NeXT-QA	Accuracy	60.0	63.8	+3.8
TVQA	Accuracy	59.7	64.6	+4.9

Experiment Figures

Impact of in-context examples on accuracy across different datasets.

Error analysis breakdown (Program vs. Module vs. Label) across datasets.

Main Takeaways

Procedural reasoning (ProViQ) outperforms end-to-end zero-shot models by large margins (up to 25%), especially on datasets requiring explicit steps (find object -> query attribute).
Performance gains are highly correlated with dataset label quality; ambiguous datasets (MSR-VTT, MSVD) show smaller gains than high-quality ones (iVQA, TGIF).
Including in-context examples in the prompt is critical; accuracy improves drastically with just 1 example and plateaus around 3-4.
The method generalizes to diverse tasks beyond QA, such as query-based multi-object tracking and video editing, without architectural changes.

📚 Prerequisite Knowledge

Prerequisites

Large Language Models (LLMs) and prompting
Visual Question Answering (VQA)
Object Detection and Tracking
Zero-shot learning

Key Terms

Zero-shot: The ability of a model to perform a task without having explicitly trained on examples of that specific task

API: Application Programming Interface—here, a set of defined functions (tools) the LLM can call in its generated code

BLIP-2: A vision-language model used here for image captioning and visual question answering on specific frames

GroundingDINO: A text-conditioned object detection model that finds bounding boxes for objects described by text

ByteTrack: A multi-object tracking algorithm that associates detected objects across video frames to maintain identity over time

LaViLa: A video-language model specialized in long-form video understanding and narrations (used here for summarization)

In-context learning: Providing the LLM with example inputs and outputs (e.g., example questions and their corresponding code) in the prompt to guide its generation

FastText: A library for efficient text classification and representation learning, used here to map open-ended outputs to fixed vocabularies