ViperGPT: Visual Inference via Python Execution for Reasoning

📝 Paper Summary

Neuro-symbolic AI Visual Question Answering (VQA) Code Generation for Reasoning

ViperGPT solves visual tasks by prompting a code-generation LLM to write Python programs that orchestrate pre-trained vision models via a defined API, enabling zero-shot reasoning without training.

Core Problem

End-to-end vision models struggle with compositional reasoning and math, while prior modular networks required difficult joint training of program generators and modules, limiting generalization.

Why it matters:

End-to-end 'black box' models are uninterpretable and cannot reliably perform simple mathematical operations (e.g., division) or logic steps necessary for complex queries
Training modular systems from scratch is unstable and data-hungry; leveraging existing strong pre-trained models without retraining allows for immediate adaptation to new tasks

Concrete Example: Query: 'How many muffins can each kid eat for it to be fair?' End-to-end models fail to count and divide accurately. ViperGPT generates Python code that detects muffins (8) and kids (2) using an API, then executes `8 // 2` to return '4'.

Key Novelty

Visual Inference via Python Execution

Replaces the learned neural program generator with a pre-trained code LLM (Codex) that translates natural language queries into executable Python code
Uses the standard Python interpreter as the 'reasoning engine' (for logic, math, control flow) and pre-trained vision models as the 'sensory engine' (for perception), connected via a simple class-based API

Architecture

The conceptual framework of ViperGPT, showing the flow from query to code to execution

Evaluation Highlights

Achieves 72.0% accuracy on RefCOCO visual grounding (zero-shot), outperforming the GLIP baseline by +17.0%
Surpasses 80-billion parameter Flamingo model on OK-VQA (External Knowledge) with 51.9% accuracy despite being zero-shot
Attains state-of-the-art results on NExT-QA video reasoning (60.0%), outperforming supervised baselines on hard temporal/causal splits

Breakthrough Assessment

9/10

Demonstrates a highly effective, training-free paradigm shift where LLMs act as controllers for vision tools via code. The performance gaps over supervised or larger end-to-end models are significant.

⚙️ Technical Details

Problem Definition

Setting: Open-ended visual query answering (images and videos)

Inputs: Visual input x (image/video) and textual query q

Outputs: Result r (text, bounding box, or classification) generated by executing program z = Φ(q) on x

Pipeline Flow

Input Query + API Spec -> Code LLM (Codex)
Code LLM -> Python Program Generation
Python Interpreter executes Program
Program calls Perception API (GLIP, BLIP-2, etc.) on Input Image
Perception Modules return intermediate results
Python Interpreter aggregates results -> Final Answer

System Modules

Program Generator

Synthesize Python code from natural language query based on API definition

Model or implementation: Codex (GPT-3 variant fine-tuned on code)

find / exists (Perception API)

Locate objects or check existence

Model or implementation: GLIP

simple_query (Perception API)

Answer basic visual questions about a specific patch

Model or implementation: BLIP-2

compute_depth (Perception API)

Estimate median depth of a patch

Model or implementation: MiDaS

llm_query

Answer general knowledge questions or perform text reasoning

Model or implementation: GPT-3

Novel Architectural Elements

Replacement of learned neural controller with off-the-shelf Code LLM via API prompting
Use of native Python interpreter for logical execution (loops, conditionals, math) rather than differentiable neural logic

Modeling

Base Model: Codex (for code generation)

Compute: Not reported in the paper (Inference-only framework using pre-trained APIs)

Comparison to Prior Work

vs. NMN: ViperGPT requires no joint training and uses Python logic instead of neural logic
vs. VisProg: ViperGPT generates executable Python code (allowing native libraries/math) rather than custom pseudocode interpreters
vs. Flamingo: ViperGPT is compositional and interpretable, outperforming Flamingo on knowledge VQA without seeing task data
+ 1 more
vs. HuggingGPT (Jarvis) [not cited in paper]: HuggingGPT uses an LLM to schedule Hugging Face models via JSON planning; ViperGPT generates direct Python logic with flow control (if/else, loops) for fine-grained reasoning

Limitations

Dependency on the performance of underlying API models (e.g., if GLIP fails to detect, the program fails)
Requires careful API design and prompt engineering (docstrings) for the LLM to understand capabilities
Inference speed limited by sequential execution of multiple large neural network calls
Codex is a closed-source commercial model

Reproducibility

Code: http://viper.cs.columbia.edu

Code available at http://viper.cs.columbia.edu. Relies on OpenAI Codex API (closed source) and various open-source vision models (GLIP, BLIP-2). Prompt templates provided in Appendix.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on visual grounding, VQA, and video reasoning tasks

Benchmarks:

RefCOCO / RefCOCO+ (Visual Grounding)
GQA (Compositional Image Question Answering)
OK-VQA (External Knowledge VQA)
NExT-QA (Video Causal/Temporal Reasoning)

Metrics:

Accuracy (Rec@1 for grounding)
Top-1 Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Zero-shot performance on Visual Grounding tasks compared to other zero-shot and supervised baselines.
RefCOCO (testA)	Accuracy	55.0	72.0	+17.0
RefCOCO+ (testA)	Accuracy	52.2	67.0	+14.8
Performance on Question Answering tasks (Compositional and Knowledge-based).
GQA (test-dev)	Accuracy	44.7	48.1	+3.4
OK-VQA	Accuracy	50.6	51.9	+1.3
Performance on Video Reasoning tasks involving temporal and causal logic.
NExT-QA (Hard Split - Temporal)	Accuracy	45.3	49.8	+4.5
NExT-QA (Full Set)	Accuracy	56.9	60.0	+3.1

Main Takeaways

Python logic enables superior performance on tasks requiring math or temporal ordering (RefCOCO, NExT-QA) compared to end-to-end neural models
Explicitly separating perception (GLIP/BLIP) and reasoning (Codex/Python) improves interpretability and allows 'debugging' intermediate steps
The framework allows zero-shot generalization to video simply by iterating over frames in Python code, without needing a dedicated video model

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Large Language Models (LLMs) and code generation (e.g., Codex)
Basic understanding of Computer Vision tasks (Detection, VQA, Depth Estimation)
Concept of Neuro-symbolic AI (combining neural networks with logic/code)

Key Terms

Codex: A large language model fine-tuned on code, capable of translating natural language instructions into executable programming code

API: Application Programming Interface—here, a set of defined Python functions (like find() or compute_depth()) that the LLM calls to use vision tools

Zero-shot: The ability of a model to perform a task without having explicitly trained on examples of that specific task

GLIP: Grounded Language-Image Pre-training—a model used here for detecting objects specified by text (e.g., finding 'muffins')

BLIP-2: A vision-language model used here for answering simple visual questions about image patches

MiDaS: A model used for estimating depth (distance from camera) for every pixel in an image

IoU: Intersection over Union—a metric for measuring the accuracy of an object detector on a particular dataset

Visual Grounding: The task of locating the specific region or bounding box in an image that corresponds to a textual description