Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

📝 Paper Summary

Multimodal Language Models Agentic AI Visual Reasoning

VisualSketchpad provides multimodal LMs with a framework to generate intermediate visual sketches (lines, boxes, masks) using Python tools to facilitate reasoning on complex vision and math tasks.

Core Problem

Current multimodal LMs rely on text-only intermediate reasoning (Chain-of-Thought), lacking the ability to draw visual artifacts (like auxiliary lines or bounding boxes) which are crucial for solving spatial and geometric problems.

Why it matters:

Humans rely on sketching for problem-solving (e.g., auxiliary lines in geometry, marking maps), but LMs currently cannot replicate this visual-spatial reasoning process.
Existing multimodal benchmarks (Geometry3K, BLINK) require symbolic grounding and spatial understanding that are difficult to express through text alone.
Without intermediate visual steps, models struggle with tasks like proving geometric theorems or counting overlapping objects.

Concrete Example: In a geometry problem asking to find an angle, a standard LM tries to solve it analytically and fails. With VisualSketchpad, the model writes Python code to draw an auxiliary parallel line on the image, visualizing the new angles to correctly solve the proof.

Key Novelty

VisualSketchpad (Visual Chain-of-Thought)

Enables LMs to 'think' visually by generating Python code to modify input images or create new plots (e.g., drawing lines, segmenting objects) before answering.
Generalizes tool use by integrating specialist vision models (detection, segmentation) and plotting libraries (Matplotlib, NetworkX) as sketching tools.
Operates as an iterative agent: Thought (plan) → Action (generate sketch code) → Observation (view updated image) → Final Answer.

Architecture

The iterative Thought-Action-Observation loop of VisualSketchpad.

Evaluation Highlights

+12.7% average accuracy gain on math tasks (Geometry, Functions, Graphs, Chess) for GPT-4o compared to baseline.
+8.6% average accuracy gain on vision tasks (BLINK, V*Bench) for GPT-4o compared to baseline.
Sets new state-of-the-art on V*Bench (80.3%), BLINK spatial reasoning (83.9%), and visual correspondence (80.8%) using GPT-4o.

Breakthrough Assessment

8/10

Significantly extends Chain-of-Thought into the visual modality without model training, yielding large gains on hard benchmarks. Bridges the gap between LMs and visual tools effectively.

⚙️ Technical Details

Problem Definition

Setting: Multimodal reasoning where an agent interacts with an environment to answer a query q involving text and images.

Inputs: Multimodal query q containing visual and textual components.

Outputs: Final text answer after T steps of reasoning (thoughts, actions, and observations).

Pipeline Flow

Thought: Model plans the next sketching step based on context
Action: Model generates Python code to invoke vision tools or plotting libraries
Observation: Environment executes code, updates the image (sketch), and returns it to the model
Repeat until Terminate action

System Modules

Planner/Reasoning Agent

Generates thoughts and Python code actions based on multimodal context

Model or implementation: GPT-4o or GPT-4-Turbo (via API)

Sketching Tools (Math) (Action Execution)

Executes code to draw mathematical diagrams

Model or implementation: Python Libraries (Matplotlib, NetworkX, Chess)

Sketching Tools (Vision) (Action Execution)

Executes vision specialists to annotate images

Model or implementation: Grounding-DINO, SAM, DepthAnything, Semantic-SAM

Novel Architectural Elements

Integration of visual artifact generation (via code and specialist models) directly into the Chain-of-Thought loop as 'Observations'
Unified framework treating both math plotting and computer vision annotations as 'sketching' actions

Modeling

Base Model: GPT-4o and GPT-4-Turbo (via API)

Compute: Inference only. Depends on API calls and local execution of vision tools (DINO, SAM, etc.).

Comparison to Prior Work

vs. VisProg/ViperGPT: Sketchpad is iterative (agents can change plans based on observations) rather than generating a single static program.
vs. SoM: Sketchpad allows the LM to decide *when* and *what* to mark/sketch dynamically, rather than applied as a fixed pre-processing step.
vs. SEAL [not cited in paper]: SEAL focuses specifically on active visual search, while Sketchpad is a general-purpose visual reasoning framework.

Limitations

Relies on the capabilities of underlying specialist vision models; failure in detection leads to failure in reasoning.
Latency is higher due to iterative API calls and image generation steps.
Prompting visual overlays can sometimes hurt performance if the clutter confuses the VLM (observed in some baselines).

Reproducibility

Code: https://visualsketchpad.github.io/

📊 Experiments & Results

Evaluation Setup

Evaluation on diverse Math and Computer Vision benchmarks using GPT-4o and GPT-4-Turbo backbones.

Benchmarks:

Geometry3K (Geometry Problem Solving)
IsoBench (Math (Functions, Graph, Chess))
V*Bench (Visual Search/Reasoning)
BLINK (Visual Perception (Depth, Spatial, Correspondence))

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Math Tasks: Sketching significantly improves performance across Geometry, Functions, Graphs, and Chess.
IsoBench (Max Flow)	Accuracy	25.0	66.3	+41.3
IsoBench (Function Convexity)	Accuracy	66.0	88.0	+22.0
Geometry3K	Accuracy	51.4	57.3	+5.9
Vision Tasks: Sketching with specialist models improves perception benchmarks.
V*Bench	Accuracy	66.0	80.3	+14.3
BLINK (Relative Depth)	Accuracy	57.0	69.1	+12.1
BLINK (Spatial Reasoning)	Accuracy	78.4	83.9	+5.5

Experiment Figures

Bar charts showing the frequency of different vision tools used by GPT-4o and GPT-4 Turbo across tasks.

Main Takeaways

Consistent improvement across all evaluated tasks (Math and Vision) over strong baselines like GPT-4o.
The method is particularly effective for tasks requiring spatial understanding (graphs, geometry, depth) or locating small objects.
Tool usage is task-dependent: V*Bench relies on detection/zoom, while Depth tasks rely on depth estimation models.
GPT-4o utilizes vision tools more frequently and effectively than GPT-4 Turbo, correlating with higher performance gains.

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Multimodal Large Language Models (MLLMs)
Chain-of-Thought (CoT) prompting
Python plotting libraries (Matplotlib, NetworkX)
Basic computer vision tasks (detection, segmentation)

Key Terms

Visual Chain-of-Thought: A reasoning process where intermediate steps involve generating and analyzing visual artifacts (sketches) rather than just text.

Auxiliary lines: Extra lines drawn on a geometry diagram to reveal relationships (e.g., parallel lines, triangles) needed to solve a proof.

SoM: Set-of-Mark—a visual prompting technique where objects in an image are overlaid with numbered masks to help LMs reference them.

V*Bench: A benchmark for evaluating MLLMs on detailed visual grounding and reasoning tasks.

BLINK: A benchmark focusing on visual perception tasks that are easy for humans but hard for current MLLMs (e.g., spatial reasoning, depth).

Grounding-DINO: An open-set object detection model that finds objects based on text queries.

Segment Anything (SAM): A model capable of generating segmentation masks for any object in an image.