MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents

📝 Paper Summary

Multimodal Agents Web Browsing Agents Benchmark Construction

MM-BrowseComp is a challenging benchmark for multimodal browsing agents that requires retrieving and reasoning over visual content on the web, revealing significant gaps in current state-of-the-art models.

Core Problem

Existing browsing benchmarks like BrowseComp focus primarily on text, overlooking the critical need for agents to retrieve and reason with multimodal content (images, videos) embedded in web pages.

Why it matters:

A vast amount of internet knowledge is locked in images and videos, which text-only search agents cannot access
Current benchmarks are nearing saturation on text tasks, failing to distinguish the reasoning capabilities of advanced agents in realistic multimodal scenarios
Approaches relying on captioning tools for images suffer from significant information loss and hallucination compared to native multimodal reasoning

Concrete Example: A user asks about the color of a specific object in a video clip. A text-only agent might find the video title but cannot watch it to answer. MM-BrowseComp requires the agent to find the video, process the visual frames, and identify the color, which cannot be solved by text search alone.

Key Novelty

Irreducible Multimodal Reasoning Checklist

Introduces a verified checklist for every question that defines the minimal reasoning path required, distinguishing genuine reasoning from lucky guessing
Enforces 'Mandatory Multimodal Dependency' where essential information is embedded primarily in visual modalities (images/videos) and not in text, preventing text-only shortcuts
Constructs questions using an inverted methodology (fact-to-question) with rigorous filtering to ensure questions are unanswerable by GPT-4o/Gemini without tools

Architecture

Overview of the MM-BrowseComp benchmark concept, showing a user query with an image, the agent browsing web pages containing text and video, and the 'Irreducible Reasoning Checklist' used for evaluation.

Evaluation Highlights

OpenAI o3 with tools achieves only 29.02% accuracy, significantly outperforming other models but still showing ample room for improvement
State-of-the-art open-source and closed-source VLMs (e.g., Gemini-2.5-Pro) fail to surpass 10% accuracy, highlighting extreme difficulty
Native multimodal agents (o3) significantly outperform agents that rely on captioning tools, which suffer from information loss

Breakthrough Assessment

9/10

Addresses a critical gap in agent evaluation (multimodality in web browsing) with a rigorous construction process. The extremely low performance of current SOTA models (even o3 is <30%) establishes it as a definitive challenge for the next generation of agents.

⚙️ Technical Details

Problem Definition

Setting: Open-ended web browsing and question answering requiring multimodal information retrieval

Inputs: Natural language question q, potentially accompanied by an image or video

Outputs: Concise, verifiable answer (e.g., name, number, color) and a completed reasoning checklist

Pipeline Flow

Agent receives multimodal user query
Agent plans search strategy
Agent executes web search / browsing actions
Agent processes retrieved content (text/images/video) to verify checklist items
Agent synthesizes final answer

System Modules

Browsing Environment

Provides the web interface for the agent to interact with

Model or implementation: Real-world web browser (or simulated equivalent)

Evaluated Agent

The VLM or Agent system being tested

Model or implementation: Various (e.g., OpenAI o3, Gemini-2.5-Pro, Agent-R1)

Novel Architectural Elements

Inclusion of an 'Irreducible Reasoning Checklist' in the dataset structure itself to force step-by-step verification of multimodal dependency

Modeling

Base Model: Evaluates multiple external models (OpenAI o3, Gemini 2.5, GPT-4o, etc.)

Comparison to Prior Work

vs. BrowseComp: MM-BrowseComp requires processing images/videos found on the web, whereas BrowseComp is text-only
vs. SimpleQA: Adds the complexity of multi-hop web retrieval and multimodal reasoning
vs. WebArena: Focuses specifically on 'deep search' for information retrieval rather than general functional tasks (booking flights, etc.) [not cited in paper]

Limitations

Evaluating closed-source models (like o3) limits visibility into their internal reasoning processes
Reliance on live web pages means content can change or disappear over time (though authors tried to ensure stability)
The dataset size (224 questions) is relatively small compared to automated benchmarks due to the high cost of expert annotation
Manual evaluation of checklists might be difficult to scale automatically without strong evaluator models

Reproducibility

Code: https://github.com/MMBrowseComp/MM-BrowseComp

publicly available (https://github.com/MMBrowseComp/MM-BrowseComp). The dataset of 224 questions and checklists is released. Evaluation scripts for open-source agents (Agent-R1, OWL, etc.) are provided. Some closed-source model checkpoints (like OpenAI o3) are accessed via API.

📊 Experiments & Results

Evaluation Setup

Agents attempt to answer questions by browsing the live web. Success is measured by final answer correctness and adherence to the reasoning checklist.

Benchmarks:

MM-BrowseComp (Multimodal Web Browsing QA) [New]

Metrics:

Overall Accuracy (OA)
Strict Accuracy (SA)
Average Checklist Score (AVG CS)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance of Tool-Augmented VLMs shows OpenAI o3 significantly leading but still struggling, while Gemini models perform poorly.
MM-BrowseComp	Overall Accuracy (OA)	8.48	29.02	+20.54
MM-BrowseComp	Strict Accuracy (SA)	5.36	21.88	+16.52
MM-BrowseComp	Average Checklist Score (AVG CS)	25.02	44.52	+19.50
Tool-Free VLMs perform extremely poorly, validating the benchmark's design requiring external information retrieval.
MM-BrowseComp	Overall Accuracy (OA)	0	8.93	+8.93

Experiment Figures

Distribution of the 224 questions across 5 categories (Media, Technology, Society, Geography, Academics) and 22 subtasks.

Main Takeaways

MM-BrowseComp is exceptionally difficult; even the best model (OpenAI o3) fails 70% of the time, and most others fail >90% of the time
Native multimodal reasoning (integrated vision-language processing) is superior to using captioning tools, which lose critical visual details
Reflective agent architectures (ReAct, self-correction) are more robust than static orchestration flows
A complete toolset is insufficient without strong underlying reasoning capabilities; similarly, strong reasoning fails without adequate browsing tools

📚 Prerequisite Knowledge

Prerequisites

Understanding of Vision-Language Models (VLMs) and their integration with tools
Familiarity with web browsing agent architectures (ReAct, etc.)
Knowledge of evaluation metrics for reasoning tasks (Accuracy vs. Strict Accuracy)

Key Terms

VLM: Vision-Language Model—an AI model trained to understand and process both text and images simultaneously

ReAct: Reasoning + Acting—a prompting paradigm where agents generate reasoning traces before executing actions (like searching the web)

Browsing Agent: An AI system equipped with tools (browser, search engine) to autonomously navigate the internet to answer questions

Irreducible Reasoning Checklist: A sequential list of minimal necessary steps (search queries, page visits, visual verifications) required to derive the correct answer, used to verify the reasoning process

OA: Overall Accuracy—the percentage of questions where the final answer matches the ground truth

SA: Strict Accuracy—the percentage of questions where the model gets the correct answer AND successfully completes all steps in the reasoning checklist

AVG CS: Average Checklist Score—the average percentage of checklist items completed across all questions

Captioning tool: A separate AI module that converts an image into a text description, often used by text-only agents to 'see' images

Hallucination: When an AI model generates plausible-sounding but factually incorrect information