← Back to Paper List

DeepEyesV2: Toward Agentic Multimodal Model

Jack Hong, Chenxiao Zhao, Chenglin Zhu, Wei Lu, Guohai Xu, Xing Yu
Xiaohongshu Inc.
arXiv.org (2025)
MM Agent RL Reasoning Benchmark

📝 Paper Summary

Agentic Multimodal Models Tool-augmented Reasoning Reinforcement Learning for LLMs
DeepEyesV2 enables multimodal models to actively interleave code execution and web search via a two-stage training pipeline combining cold-start fine-tuning with reinforcement learning.
Core Problem
Existing Multimodal Large Language Models (MLLMs) are passive, lacking the ability to actively invoke tools for fine-grained perception or up-to-date information, and direct RL fails to induce robust tool use.
Why it matters:
  • Current models cannot perform precise operations (e.g., measuring, cropping) or access real-time data, leading to hallucinations and calculation errors
  • Direct reinforcement learning without initialization leads to 'reward hacking' (e.g., generating useless code comments) rather than functional tool use
  • Existing benchmarks evaluate perception, reasoning, or search in isolation, failing to assess the coordinated integration required for real-world tasks
Concrete Example: When asked to identify a flower species in an image, a standard model guesses based on general features. An agentic model should crop the flower to observe details, search the cropped image, and verify the species, but without proper training, it often fails to invoke these tools or generates buggy code.
Key Novelty
Two-Stage Agentic Training Pipeline (Cold-Start SFT + RL)
  • Implements a 'cold-start' stage using a curated dataset of difficult, tool-necessary examples to establish basic execution patterns, preventing the RL reward hacking observed in pioneer experiments
  • Follows with an outcome-driven Reinforcement Learning stage that optimizes tool invocation strategies using only accuracy and format rewards, without complex intermediate reward engineering
  • Unifies 'Operation tools' (code execution/cropping) and 'Information retrieval tools' (web search) within a single dynamic reasoning loop
Architecture
Architecture Figure Figure 3
The inference pipeline where DeepEyesV2 interleaves reasoning with tool invocation (code and search).
Evaluation Highlights
  • Achieves 28.9% average score on RealX-Bench, outperforming Qwen2.5-VL-7B (12.3%) and the previous DeepEyes model (12.8%)
  • Surpasses MMSearch-R1 on the MMSearch benchmark (63.7% vs 53.8%) by effectively combining search with perception
  • Improves mathematical reasoning on MathVerse by +7.1 points (reaching 52.7% accuracy) through active code execution
Breakthrough Assessment
8/10
Strong methodological contribution in stabilizing tool-use training via cold-start SFT and demonstrating the synergy of search and code execution. The proposal of RealX-Bench fills a critical gap in evaluating integrated multimodal capabilities.
×