Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

📝 Paper Summary

Visual Instruction Tuning Neuro-Symbolic Reasoning Vision-Language Models

VPD improves Vision-Language Models by distilling the reasoning traces of verified, LLM-generated visual programs into the model, enabling complex reasoning in a single forward pass without external tools.

Core Problem

Existing methods for complex visual tasks either rely on slow, error-prone execution of explicit programs involving multiple models, or standard instruction tuning that fails to capture fine-grained reasoning steps.

Why it matters:

Program-based approaches (like VisProg) suffer from high latency and computational cost due to loading multiple specialized models.
Generated programs often contain errors or omit steps, and cannot recover when specialized tools fail.
Standard VLMs struggle with compositional skills like counting, spatial reasoning, and using external knowledge because image captions lack fine-grained reasoning details.

Concrete Example: For the question 'Who invented the musical instrument on the right?', a standard VLM might guess based on superficial features. A program-based approach might correctly identify the object but fail if the object detector acts up or the API call times out. VPD distills the successful program trace (detect -> crop -> classify -> lookup) into the VLM's internal weights.

Key Novelty

Visual Program Distillation (VPD)

Leverages LLMs to generate multiple candidate Python programs for visual tasks, executes them with tools, and filters for those that produce the correct answer.
Translates the execution traces of correct programs into natural language Chain-of-Thought (CoT) rationales.
Distills these rationales into a single VLM, teaching it to mimic the tool-use reasoning process internally without needing the actual tools at inference time.

Architecture

The VPD framework pipeline, showing the transition from program generation to distillation.

Evaluation Highlights

PaLI-X-VPD (55B) achieves state-of-the-art results on 8 classical VQA tasks and 2 zero-shot benchmarks, outperforming the underlying PaLI-X base model.
Outperforms proprietary GPT-4V on complex reasoning tasks involving counting and spatial relations.
Improves factuality and consistency in human evaluations compared to standard instruction-tuned counterparts.

Breakthrough Assessment

8/10

Significantly advances VLM reasoning by effectively bridging neuro-symbolic program execution and end-to-end dense model training. Eliminates the runtime cost of tool-use while retaining reasoning benefits.

⚙️ Technical Details

Problem Definition

Setting: Visual Question Answering and Reasoning via Instruction Tuning

Inputs: Image i and natural language query q

Outputs: Answer y and optionally a Chain-of-Thought rationale c

Pipeline Flow

Program Generation (LLM creates candidates)
Program Execution (Tools execute candidates)
Filtering & Translation (Select correct trace -> Convert to CoT)
Distillation (Train VLM on Image + Query -> CoT + Answer)

System Modules

Program Generator (Data Synthesis (Offline))

Generate candidate Python programs based on the query

Model or implementation: PaLM-2

Execution Engine (Data Synthesis (Offline))

Run the programs using specialized vision tools

Model or implementation: Hybrid (PaLI-X, OWLv2, Google Depth API, PaLM-2)

Trace Translator (Data Synthesis (Offline))

Convert code execution traces into natural language reasoning

Model or implementation: PaLM-2

Student VLM

Single end-to-end model that predicts answer and rationale

Model or implementation: PaLI-X (55B) or PaLI-3 (5B)

Novel Architectural Elements

Distillation pipeline that converts executable program traces into natural language CoT for end-to-end VLM training

Modeling

Base Model: PaLI-X (55B) and PaLI-3 (5B)

Training Method: Supervised Fine-Tuning (Instruction Tuning)

Objective Functions:

Purpose: Minimize prediction error for both the answer and the rationale.

Formally: L = L_cross_entropy(y_hat, y) + lambda * L_rationale(c, c_hat)

Training Data:

Program Sampling: k=5 candidates per query
Filtering: Keep program if execution result matches ground truth label
CoT Generation: 20 hand-crafted few-shot examples used to prompt PaLM-2 to rewrite traces

Key Hyperparameters:

program_sampling_temperature: 0.5
program_sampling_k: 5

Compute: Not reported in the paper

Comparison to Prior Work

vs. VisProg/ViperGPT: VPD moves the program execution to the data generation phase, resulting in a single fast VLM at inference time rather than a slow multi-model pipeline.
vs. LLaVA/InstructBLIP: VPD uses execution traces verified by ground truth to create dense, accurate reasoning chains, whereas others rely on potentially hallucinated or less detailed LLM-generated captions.

Limitations

Ambiguity in natural language answers can make verifying program correctness difficult (handled via LLM judging).
Depends on the quality of the tools available during the data synthesis phase; if tools fail consistently, no training data is generated.
Requires ground truth labels for the most effective filtering of candidate programs.

Reproducibility

Code: https://github.com/Yushi-Hu/Visual-Program-Distillation

Code is publicly available at https://github.com/Yushi-Hu/Visual-Program-Distillation. The paper uses PaLI-X and PaLM-2, which are proprietary Google models, but the method is model-agnostic. Exact training compute resources are not specified.

📊 Experiments & Results

Evaluation Setup

Fine-tuning VLMs on VPD-generated data and evaluating on VQA and reasoning benchmarks.

Benchmarks:

OK-VQA (Knowledge-based VQA)
A-OKVQA (Knowledge-based VQA)
GQA (Compositional Reasoning)
TallyQA (Counting)
Hateful Memes (Content Moderation / Reasoning)
MMBench (Multimodal Benchmark)
POPE (Hallucination Evaluation)

Metrics:

Accuracy (VQA accuracy)
F1 Score (for POPE)
ROC AUC (for Hateful Memes)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
VPD achieves state-of-the-art results on multiple VQA and reasoning benchmarks compared to the base PaLI-X model and other leading VLMs.
OK-VQA	Accuracy	66.1	68.8	+2.7
A-OKVQA (val)	Accuracy	64.5	65.6	+1.1
TallyQA (Simple)	Accuracy	83.6	88.6	+5.0
TallyQA (Complex)	Accuracy	68.6	73.9	+5.3
Hateful Memes	ROC AUC	84.9	87.1	+2.2
MMBench	Accuracy	79.0	81.3	+2.3

Main Takeaways

VPD consistently improves performance on tasks requiring compositional reasoning (GQA, TallyQA) and external knowledge (OK-VQA).
The method works for both large (55B) and smaller (5B) models, showing scalability.
Sampling multiple programs and filtering by execution correctness is crucial; using only the top-1 program yields significantly lower performance.
Human evaluation confirms that VPD-generated rationales are more factual and consistent with the answer compared to standard instruction tuning.

📚 Prerequisite Knowledge

Prerequisites

Vision-Language Models (VLMs)
Instruction Tuning
Chain-of-Thought (CoT) Prompting
Python programming concepts (for understanding visual programs)

Key Terms

VPD: Visual Program Distillation—the proposed method of training VLMs on traces of verified visual programs.

PaLI-X: A large-scale multilingual Vision-Language Model used as the backbone for experiments.

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer.

Visual Program: An executable script (often Python) generated by an LLM that calls computer vision tools (like object detectors) to solve a task.

Instruction Tuning: Fine-tuning a pre-trained model on datasets formatted as instructions (input) and desired responses (output) to improve amenability to user commands.

Distillation: The process of transferring knowledge from a large or complex teacher system (here, the program execution pipeline) to a single student model.

OWLv2: Open-Vocabulary Object Detection model used as a tool in the program generation phase.