← Back to Paper List

Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

Yushi Hu, Otilia Stretcu, Chun-Ta Lu, Krishnamurthy Viswanathan, K. Hata, Enming Luo, Ranjay Krishna, Ariel Fuxman
Google Research, University of Washington
Computer Vision and Pattern Recognition (2023)
MM Agent Reasoning Factuality

📝 Paper Summary

Visual Instruction Tuning Neuro-Symbolic Reasoning Vision-Language Models
VPD improves Vision-Language Models by distilling the reasoning traces of verified, LLM-generated visual programs into the model, enabling complex reasoning in a single forward pass without external tools.
Core Problem
Existing methods for complex visual tasks either rely on slow, error-prone execution of explicit programs involving multiple models, or standard instruction tuning that fails to capture fine-grained reasoning steps.
Why it matters:
  • Program-based approaches (like VisProg) suffer from high latency and computational cost due to loading multiple specialized models.
  • Generated programs often contain errors or omit steps, and cannot recover when specialized tools fail.
  • Standard VLMs struggle with compositional skills like counting, spatial reasoning, and using external knowledge because image captions lack fine-grained reasoning details.
Concrete Example: For the question 'Who invented the musical instrument on the right?', a standard VLM might guess based on superficial features. A program-based approach might correctly identify the object but fail if the object detector acts up or the API call times out. VPD distills the successful program trace (detect -> crop -> classify -> lookup) into the VLM's internal weights.
Key Novelty
Visual Program Distillation (VPD)
  • Leverages LLMs to generate multiple candidate Python programs for visual tasks, executes them with tools, and filters for those that produce the correct answer.
  • Translates the execution traces of correct programs into natural language Chain-of-Thought (CoT) rationales.
  • Distills these rationales into a single VLM, teaching it to mimic the tool-use reasoning process internally without needing the actual tools at inference time.
Architecture
Architecture Figure Figure 2
The VPD framework pipeline, showing the transition from program generation to distillation.
Evaluation Highlights
  • PaLI-X-VPD (55B) achieves state-of-the-art results on 8 classical VQA tasks and 2 zero-shot benchmarks, outperforming the underlying PaLI-X base model.
  • Outperforms proprietary GPT-4V on complex reasoning tasks involving counting and spatial relations.
  • Improves factuality and consistency in human evaluations compared to standard instruction-tuned counterparts.
Breakthrough Assessment
8/10
Significantly advances VLM reasoning by effectively bridging neuro-symbolic program execution and end-to-end dense model training. Eliminates the runtime cost of tool-use while retaining reasoning benefits.
×