← Back to Paper List

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Zhao-yu Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, Linjie Li, Yu Cheng, Heng Ji, Junxian He, Yi R. Fung
The Hong Kong University of Science and Technology, Microsoft, The Chinese University of Hong Kong
arXiv.org (2025)
MM Reasoning Agent RL Benchmark

📝 Paper Summary

Multimodal Reasoning Visual Chain-of-Thought
The paper proposes a paradigm shift from 'Thinking about Images' (text-only reasoning on static images) to 'Thinking with Images' (using generated or manipulated visual content as active intermediate reasoning steps).
Core Problem
Current Large Multimodal Models (LMMs) treat images as static context encoded once, creating a semantic gap where fine-grained details and spatial relationships are lost during flattening into features.
Why it matters:
  • Models falter on tasks requiring iterative visual engagement, such as complex physical reasoning or precise spatial manipulation, because they cannot 'look again' or simulate outcomes
  • Text-centric Chain-of-Thought (CoT) relies on symbolic logic which is brittle for describing continuous visual or physical properties
  • Current approaches suffer from an information bottleneck where the initial one-time encoding is insufficient for answering complex queries
Concrete Example: To determine if a robot can grasp an apple without hitting a glass, a text-only model might hallucinate coordinates. A 'Thinking with Images' model generates a future frame; if the generated image shows a collision, this visual evidence forces the model to revise its plan.
Key Novelty
The 'Thinking with Images' Taxonomy
  • Defines a three-stage evolution of visual reasoning: (1) Tool-Driven Exploration (calling external APIs), (2) Programmatic Manipulation (generating code to visualize), and (3) Intrinsic Imagination (internal image generation).
  • Formalizes visual reasoning as a mixed-modal state sequence where intermediate steps can be visual artifacts (images, crops, plots) rather than just text tokens.
Breakthrough Assessment
9/10
This is a foundational survey defining a new sub-field. It provides a clear taxonomy and theoretical grounding for the emerging trend of visual generation-as-reasoning.
×