Simple o3: Towards Interleaved Vision-Language Reasoning

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Chain-of-Thought (CoT) Reasoning Tool-augmented Reasoning

Simple o3 emulates the 'thinking with images' paradigm by integrating dynamic visual tools into an interleaved vision-language reasoning chain, trained on a synthesized dataset of 146K diverse samples.

Core Problem

Existing MLLMs lack extended Chain-of-Thought capabilities in multimodal scenarios, specifically the ability to iteratively manipulate and revisit visual information during reasoning.

Why it matters:

Current approaches separate perception from reasoning, limiting performance on complex tasks requiring hierarchical decomposition.
Eliciting tool-use often relies on resource-intensive Reinforcement Learning or human annotation, lacking scalable data synthesis pipelines.
The impact of specific visual tools and input resolution on interleaved reasoning remains underexplored.

Concrete Example: When answering a question about a small detail in a high-resolution image, standard models might miss the entity due to fixed resolution. Simple o3 uses 'focus_area' to crop the relevant region, creating a new visual token that grounds the reasoning, eventually leading to the correct answer.

Key Novelty

End-to-end framework for 'thinking with images' via tool interaction

Reproduces OpenAI's o3 paradigm using a scalable 'observe-reason-act' data synthesis pipeline that generates interleaved image-text reasoning chains.
Integrates dynamic visual tools (focus_area, zoom_in, reuse) directly into the reasoning process, allowing the model to modify its visual input iteratively.
Employs a modality-aware masking strategy during training to optimize text generation while maintaining cross-modal context from intermediate visual states.

Architecture

The inference workflow of Simple o3 showing the iterative loop of reasoning, tool execution, and observation update.

Evaluation Highlights

+31.2 point improvement on the MME reasoning subset compared to the base Qwen2.5-VL-7B model, surpassing GPT-4o by 27 points.
Achieves 7.4% and 12.9% improvements on fine-grained perception benchmarks HR-Bench 4K and VStarBench respectively.
+16.6% improvement on MMVet spatial reasoning subtasks, demonstrating enhanced understanding of object relationships.

Breakthrough Assessment

8/10

Significantly advances open-source multimodal reasoning by successfully replicating the 'thinking with images' paradigm with a scalable data pipeline, showing massive gains on reasoning benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Interleaved vision-language reasoning where a model generates a chain of reasoning steps, tool calls, and updated visual observations to answer a query.

Inputs: Input query Q and input image I

Outputs: Final response S, derived through an iterative reasoning chain involving text and transformed images

Pipeline Flow

Input Processing: Query + Image
Reasoning Loop: Model generates reasoning content -> Tool Call -> Visual Processing -> New Observation
Final Generation: Answer extraction

System Modules

MLLM Core

Generates reasoning steps, plans visual operations, and formulates final answers

Model or implementation: Qwen2.5-VL-7B

Tool Executor

Parses tool commands and executes image transformations

Model or implementation: Deterministic Image Processing Functions

Visual Encoder

Encodes the original and transformed images into visual tokens

Model or implementation: Qwen2.5-VL Vision Encoder (Frozen)

Novel Architectural Elements

Interleaved 'user-assistant' training format where tool outputs are treated as observations embedded directly into the conversation history
Integration of specific visual tools (focus_area, zoom_in, reuse) enabling the model to dynamically alter its own visual input stream

Modeling

Base Model: Qwen2.5-VL-7B

Training Method: Supervised Fine-Tuning (SFT) with modality-aware masking

Objective Functions:

Purpose: Optimize textual generation while conditioning on visual inputs.

Formally: Masked cross-entropy loss L(θ) = - sum(m_t * log P(z_t | H_{t-1}; θ)), where m_t is 1 for text tokens and 0 for visual tokens.

Adaptation: Full parameter fine-tuning (vision encoder and projector frozen)

Trainable Parameters: LLM backbone parameters

Training Data:

TWI-Tools-146K dataset
Includes 100K high-quality synthesized samples from MATHV360K, LLaVA-CoT-100K, and ArxivQA
Generated via 'observe-reason-act' cycle using Gemini-2.5-Flash and verified by Qwen3-turbo

Key Hyperparameters:

learning_rate: 1.0e-5
batch_size: 8
epochs: 1
+ 3 more
scheduler: cosine (10% warmup)
max_sequence_length: 8192
image_resolution_range: Min 4x28x28, Max 1024x28x28

Compute: Not reported in the paper

Comparison to Prior Work

vs. GPT-4o: Simple o3 achieves higher performance on MME reasoning subset despite smaller size (7B vs proprietary).
vs. DeepEyes/Chain-of-Focus: Simple o3 uses a broader toolset (including 'reuse' and 'zoom_in') and relies on SFT with synthesized data rather than RL.
vs. LLaVA-OneVision: Simple o3 integrates dynamic tool interaction allowing iterative visual refinement, whereas LLaVA-OneVision typically processes static inputs.

Limitations

The model currently supports a limited set of three visual tools (focus_area, zoom_in, reuse).
Performance on hallucination benchmarks like POPE is not best-in-class, though improved.
The 'zoom_in' tool provides only marginal benefits compared to 'reuse' and 'focus_area'.
Reliance on a specific proprietary model (Gemini) for the data synthesis pipeline.

Reproducibility

Code: https://github.com/twi-tools/simple_o3

The TWI-Tools-146K dataset and code are publicly available. The paper specifies the base model (Qwen2.5-VL-7B) and the generator models (Gemini-2.5-Flash) used for data synthesis.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation across diverse multimodal benchmarks covering reasoning, perception, VQA, and hallucination.

Benchmarks:

MME (Reasoning subset) (Multimodal Reasoning)
HR-Bench 4K (Fine-grained Perception)
VStarBench (Fine-grained Perception)
ScienceQA (General VQA)
HallusionBench (Hallucination Evaluation)

Metrics:

Accuracy
Score (Standard benchmark metrics)
CIDEr (for COCO Caption)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Simple o3 demonstrates superior performance on reasoning and fine-grained perception benchmarks compared to the base model and proprietary baselines.
MME (Reasoning)	Score	138.6	188.6	+50.0
HR-Bench 4K	Accuracy	69.1	76.5	+7.4
VStarBench	Accuracy	59.2	72.1	+12.9
ScienceQA	Accuracy	88.7	90.0	+1.3
MMVet	Score	65.3	66.8	+1.5
MME	Score	157.4	188.6	+31.2
VStarBench	Accuracy	65.2	72.1	+6.9

Main Takeaways

The 'reuse' tool, which re-inputs the original image, significantly boosts reasoning by introducing additional visual tokens, validating 'thinking with images'.
The 'focus_area' tool (cropping) is essential for fine-grained perception tasks where target objects are small relative to the image.
Including diverse training data (specifically MathV360K) greatly enhances logical reasoning capabilities, even if the specific math tasks are excluded.
Simple o3 outperforms RL-based 'thinking with images' approaches (DeepEyes, Chain-of-Focus) on perception benchmarks without complex RL training.

📚 Prerequisite Knowledge

Prerequisites

Multimodal Large Language Models (MLLMs)
Chain-of-Thought (CoT) reasoning
Supervised Fine-Tuning (SFT)
Visual grounding and coordinate systems

Key Terms

interleaved vision-language reasoning: A reasoning process where text generation and visual perception alternate, allowing the model to process new visual information (like crops) dynamically.

focus_area: A visual tool that returns an image cropped to a specific bounding box defined by the model.

zoom_in: A visual tool that magnifies the entire image area via interpolation.

reuse: A visual tool that directly outputs the original image, introducing additional visual tokens to reinforce perception.

modality-aware masking: A training technique where gradients are computed only for textual outputs, while visual tokens serve as context (masked out in loss computation).

thinking with images: An iterative paradigm where visual perception and cognitive processing co-evolve, transforming images during the reasoning process.

CoT: Chain-of-Thought—a technique enabling LLMs to solve complex problems by generating intermediate reasoning steps.