Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

📝 Paper Summary

Multimodal Reasoning Visual Chain-of-Thought

The paper proposes a paradigm shift from 'Thinking about Images' (text-only reasoning on static images) to 'Thinking with Images' (using generated or manipulated visual content as active intermediate reasoning steps).

Core Problem

Current Large Multimodal Models (LMMs) treat images as static context encoded once, creating a semantic gap where fine-grained details and spatial relationships are lost during flattening into features.

Why it matters:

Models falter on tasks requiring iterative visual engagement, such as complex physical reasoning or precise spatial manipulation, because they cannot 'look again' or simulate outcomes
Text-centric Chain-of-Thought (CoT) relies on symbolic logic which is brittle for describing continuous visual or physical properties
Current approaches suffer from an information bottleneck where the initial one-time encoding is insufficient for answering complex queries

Concrete Example: To determine if a robot can grasp an apple without hitting a glass, a text-only model might hallucinate coordinates. A 'Thinking with Images' model generates a future frame; if the generated image shows a collision, this visual evidence forces the model to revise its plan.

Key Novelty

The 'Thinking with Images' Taxonomy

Defines a three-stage evolution of visual reasoning: (1) Tool-Driven Exploration (calling external APIs), (2) Programmatic Manipulation (generating code to visualize), and (3) Intrinsic Imagination (internal image generation).
Formalizes visual reasoning as a mixed-modal state sequence where intermediate steps can be visual artifacts (images, crops, plots) rather than just text tokens.

Breakthrough Assessment

9/10

This is a foundational survey defining a new sub-field. It provides a clear taxonomy and theoretical grounding for the emerging trend of visual generation-as-reasoning.

⚙️ Technical Details

Problem Definition

Setting: Multimodal reasoning where the reasoning history S_t includes both text tokens and visual artifacts.

Inputs: Query Q and Image I

Outputs: Answer A, derived via a sequence of intermediate steps z_t where z_t can be text or a new image/visual manipulation.

Pipeline Flow

Stage 1: Tool-Driven Visual Exploration
Stage 2: Programmatic Visual Manipulation
Stage 3: Intrinsic Visual Imagination

System Modules

Stage 1: Tool-Driven Explorer (Evolutionary Stages)

Orchestrate external visual tools to gather data

Model or implementation: Varies (e.g., GPT-4 + API calls)

Stage 2: Visual Programmer (Evolutionary Stages)

Generate executable code to perform custom visual analysis

Model or implementation: Code-generating LMM

Stage 3: Intrinsic Visual Thinker (Evolutionary Stages)

Directly generate new images/simulations as mental steps

Model or implementation: Unified Generative LMM

Novel Architectural Elements

Formalization of the reasoning state history S_t to include z_t from the union of Text Space and Visual Artifact Space
Proposed feedback loop where generated visual content is fed back into the model as new context for self-correction

Comparison to Prior Work

vs. Standard CoT: Introduces visual artifacts (crops, plots, generated images) into the reasoning chain, whereas Standard CoT is text-only
vs. Tool-use Agents: Differentiates between simple tool use (Stage 1) and intrinsic imagination (Stage 3) where the model itself generates the visual thought
Contribution: Unifies disparate methods (tool use, code gen, image gen) under a single 'Thinking with Images' taxonomy

Limitations

Computational Cost: Processing generated images requires massive compute compared to text tokens ('token explosion').
Error Propagation: Visual hallucinations (e.g., generating a wrong auxiliary line) can corrupt the ground truth for all subsequent reasoning.
Architectural Disconnect: Current modular designs (separate vision/language models) hinder the tight feedback loop needed for fast iterative visual thinking.
Strategy Selection: Models struggle to autonomously decide which visual strategy (zoom vs. simulate vs. code) is best for a given problem.

Reproducibility

Code: https://github.com/zhaochen0110/Awesome_Think_With_Images

This is a survey paper. The authors provide a GitHub repository (https://github.com/zhaochen0110/Awesome_Think_With_Images) tracking the papers and methods discussed.

📊 Experiments & Results

Evaluation Setup

Review of existing benchmarks in Multimodal Reasoning

Benchmarks:

MMMU (Multimodal Multi-discipline Understanding)
MathVista (Visual Mathematical Reasoning)
ScienceQA (Scientific Question Answering)

Metrics:

Not explicitly reported in the paper
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Conceptual comparison between 'Thinking about Images' and 'Thinking with Images'.

Main Takeaways

The field is moving from static perception ('Thinking about') to dynamic manipulation ('Thinking with'), driven by the need for fine-grained spatial and physical reasoning.
A three-stage evolution is observed: initially relying on external tools (Stage 1), then code generation (Stage 2), and finally intrinsic image generation (Stage 3).
Visual simulation allows models to 'outsource' validation to the consistency of the visual world (e.g., checking if a generated plan looks physically impossible).
Major challenges remain in computational efficiency and preventing visual hallucinations from poisoning the reasoning chain.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Chain-of-Thought (CoT) prompting
Familiarity with Large Multimodal Models (LMMs)
Basic knowledge of Agentic AI (tool use)

Key Terms

LMM: Large Multimodal Model—AI models capable of processing and generating both text and images (e.g., GPT-4V, LLaVA)

CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before the final answer

Thinking about Images: The traditional paradigm where an image is encoded once into static features, and all subsequent reasoning is done via text

Thinking with Images: The proposed paradigm where models actively generate, crop, or manipulate images as intermediate steps to support reasoning

SFT: Supervised Fine-Tuning—training a model on labeled examples to follow specific instructions or reasoning patterns

RL: Reinforcement Learning—training models via rewards/penalties to optimize complex behaviors

Visual Chain of Thought: A reasoning sequence where some links in the chain are visual (e.g., a generated diagram) rather than textual

Intrinsic Imagination: The ability of a model to internally generate new visual representations (like mental simulations) without relying on external tools or code