DeepEyesV2: Toward Agentic Multimodal Model

📝 Paper Summary

Agentic Multimodal Models Tool-augmented Reasoning Reinforcement Learning for LLMs

DeepEyesV2 enables multimodal models to actively interleave code execution and web search via a two-stage training pipeline combining cold-start fine-tuning with reinforcement learning.

Core Problem

Existing Multimodal Large Language Models (MLLMs) are passive, lacking the ability to actively invoke tools for fine-grained perception or up-to-date information, and direct RL fails to induce robust tool use.

Why it matters:

Current models cannot perform precise operations (e.g., measuring, cropping) or access real-time data, leading to hallucinations and calculation errors
Direct reinforcement learning without initialization leads to 'reward hacking' (e.g., generating useless code comments) rather than functional tool use
Existing benchmarks evaluate perception, reasoning, or search in isolation, failing to assess the coordinated integration required for real-world tasks

Concrete Example: When asked to identify a flower species in an image, a standard model guesses based on general features. An agentic model should crop the flower to observe details, search the cropped image, and verify the species, but without proper training, it often fails to invoke these tools or generates buggy code.

Key Novelty

Two-Stage Agentic Training Pipeline (Cold-Start SFT + RL)

Implements a 'cold-start' stage using a curated dataset of difficult, tool-necessary examples to establish basic execution patterns, preventing the RL reward hacking observed in pioneer experiments
Follows with an outcome-driven Reinforcement Learning stage that optimizes tool invocation strategies using only accuracy and format rewards, without complex intermediate reward engineering
Unifies 'Operation tools' (code execution/cropping) and 'Information retrieval tools' (web search) within a single dynamic reasoning loop

Architecture

The inference pipeline where DeepEyesV2 interleaves reasoning with tool invocation (code and search).

Evaluation Highlights

Achieves 28.9% average score on RealX-Bench, outperforming Qwen2.5-VL-7B (12.3%) and the previous DeepEyes model (12.8%)
Surpasses MMSearch-R1 on the MMSearch benchmark (63.7% vs 53.8%) by effectively combining search with perception
Improves mathematical reasoning on MathVerse by +7.1 points (reaching 52.7% accuracy) through active code execution

Breakthrough Assessment

8/10

Strong methodological contribution in stabilizing tool-use training via cold-start SFT and demonstrating the synergy of search and code execution. The proposal of RealX-Bench fills a critical gap in evaluating integrated multimodal capabilities.

⚙️ Technical Details

Problem Definition

Setting: Multimodal agentic reasoning where the model must dynamically plan and execute external tool calls (code, search) to answer user queries

Inputs: Image and text query

Outputs: Final text answer derived from iterative reasoning and tool observations

Pipeline Flow

Input Processing
Reasoning & Planning
Tool Execution (Code/Search)
Observation Integration
Iterative Refinement

System Modules

Multimodal Reasoner

Generate reasoning plans, decide on tool invocation, and synthesize final answers

Model or implementation: Qwen2.5-VL (fine-tuned)

Code Executor (Tool Execution)

Execute generated Python code in a sandbox

Model or implementation: Python Interpreter

Web Searcher (Tool Execution)

Retrieve external information

Model or implementation: SerpAPI

Novel Architectural Elements

Unified reasoning loop where code execution and multimodal web search are treated as complementary, interleavable actions within the same trajectory

Modeling

Base Model: Qwen2.5-VL

Training Method: Two-stage pipeline: Cold-start SFT followed by Reinforcement Learning

Objective Functions:

Purpose: Optimize accuracy and output structure during RL.

Formally: R = R_{acc} + R_{format}
Purpose: Assess correctness of the final answer.

Formally: R_{acc} (1 if correct, 0 otherwise)
Purpose: Penalize format violations.

Formally: R_{format} (negative penalty for invalid outputs)

Training Data:

Data collected for perception, reasoning, and search tasks
Cleaned and filtered: kept only 'hard' examples (base model fails) where tool use is beneficial
Cold-start data: Trajectories synthesized by Gemini 2.5 Pro/GPT-4o/Claude Sonnet 4

Compute: Not reported in the paper

Comparison to Prior Work

vs. DeepEyes (v1): DeepEyesV2 integrates search and code execution, not just cropping, enabling knowledge-intensive tasks
vs. MMSearch-R1: DeepEyesV2 combines perception (cropping/measuring via code) with search, whereas MMSearch-R1 lacks fine-grained visual operations
vs. PyVision/Thyme: DeepEyesV2 includes web search capabilities, whereas these models are restricted to image manipulation code [not cited in paper as direct baseline, but mentioned in related work]

Limitations

Requires a cold-start stage; direct RL fails to learn robust tool use
Evaluation on RealX-Bench shows performance is still well below human capability (28.9% vs 62.1%)

Reproducibility

Code availability is not explicitly provided in the paper text. The dataset 'RealX-Bench' is introduced but no URL is provided in the snippets. Construction details for data are described.

📊 Experiments & Results

Evaluation Setup

Evaluation across multiple benchmarks focusing on perception, reasoning, and search capabilities

Benchmarks:

RealX-Bench (Real-world integrated multimodal reasoning (Perception + Search + Reasoning)) [New]
MathVerse (Mathematical reasoning)
MMSearch (Search-intensive QA)

Metrics:

Accuracy
Pass Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
DeepEyesV2 demonstrates significant improvements over its base model (Qwen2.5-VL) and prior single-tool baselines on the proposed RealX-Bench, particularly in tasks requiring capability integration.
RealX-Bench	Average Score	12.3	28.9	+16.6
RealX-Bench	Average Score	12.8	28.9	+16.1
MMSearch	Score	53.8	63.7	+9.9
MathVerse	Accuracy	45.6	52.7	+7.1

Experiment Figures

Pioneer experiments showing the failure of direct RL training.

Performance comparison on RealX-Bench across different capabilities (Perception, Reasoning, Search, Integration).

Main Takeaways

Task-adaptive tool use: The model autonomously learns to use image operations for perception tasks and numerical computation for reasoning tasks.
Reinforcement Learning is effective for complex tool combinations but requires a cold-start SFT stage to prevent reward hacking.
RealX-Bench reveals a large gap between current models (best ~29%) and human performance (~62%), highlighting the difficulty of integrated multimodal reasoning.

📚 Prerequisite Knowledge

Prerequisites

Multimodal Large Language Models (MLLMs)
Reinforcement Learning (RL) in LLMs
Supervised Fine-Tuning (SFT)
Chain-of-Thought (CoT) reasoning

Key Terms

SFT: Supervised Fine-Tuning—training a model on a labeled dataset to instill specific behaviors (here, basic tool use) before further optimization

Cold-start: An initial training phase used to bootstrap the model's capabilities (e.g., generating valid code) so that subsequent reinforcement learning can explore effective strategies without failing immediately

Reward hacking: A phenomenon where an RL agent maximizes the reward function by finding loopholes (e.g., generating empty code blocks to get a 'tool use' bonus) without solving the actual task

Agentic MLLM: A multimodal model that acts as an autonomous agent, actively planning and invoking external tools to perceive, search, and reason rather than just generating text

Chain-of-Thought (CoT): A prompting or training technique where the model generates intermediate reasoning steps before producing the final answer

Reward engineering: The complex design of reward functions to guide RL agents; this paper minimizes it by using simple outcome-based rewards