Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

📝 Paper Summary

Agentic Search Autonomous Web Agents

Mind2Web 2 is a benchmark for long-horizon agentic search that uses a novel tree-structured Agent-as-a-Judge framework to automatically evaluate complex, time-varying answers.

Core Problem

Existing benchmarks for web agents either focus on short-horizon tasks (single website) or rely on static, predefined answers, making them unsuitable for evaluating 'Deep Research' agents that produce complex, time-varying reports spanning many websites.

Why it matters:

Current evaluation methods cannot handle tasks where the correct answer changes over time (e.g., product prices, availability), hindering progress in realistic agent deployment
Reliable automated evaluation is critical for trust; users need to know if an agent's long report is grounded in sources or hallucinated, without manually re-doing the search
Short-horizon benchmarks fail to test an agent's ability to maintain focus and context over hours of browsing and hundreds of actions

Concrete Example: A task asks for five IKEA furniture items under $600 total. The answer changes daily based on stock/price. A traditional benchmark with a static answer key would fail valid new items. A human evaluator is too slow. Mind2Web 2 automates this by verifying live data against the agent's specific constraints.

Key Novelty

Agent-as-a-Judge with Tree-Structured Rubrics

Decomposes complex evaluation into a hierarchical tree of criteria (rubric), where leaf nodes are binary checks (e.g., 'Is price < $200?') and internal nodes aggregate scores
Leverages generation-verification asymmetry: while finding the answer is hard, verifying specific criteria (correctness and attribution) is easier for a specialized judge agent
Uses a 'gate-then-average' logic where critical nodes (essential constraints) can zero-out branches, while non-critical nodes allow for partial credit

Architecture

The structure of the Rubric Tree used for evaluation

Evaluation Highlights

OpenAI Deep Research achieves 0.54 partial completion score, reaching ~70% of human performance (0.79) while taking less than half the time (8.40 min vs 18.40 min)
Current agents struggle with explicitly time-varying tasks; most systems perform worse on this subset, while humans and OpenAI Operator (which browse live) sustain performance
Automated Judge Agent achieves 99.03% verification correctness compared to human evaluation, enabling reliable scalable benchmarking without human-in-the-loop

Breakthrough Assessment

9/10

Addresses the critical bottleneck of evaluating open-ended, time-varying agent tasks. The tree-structured judge is a significant methodological advance over standard LLM-as-a-Judge.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of autonomous agents on open-ended, long-horizon web search tasks requiring information synthesis from multiple sources

Inputs: Natural language task description (e.g., 'Find a hotel in Paris under $200 with free wifi...')

Outputs: Comprehensive text answer with citations pointing to source URLs

Pipeline Flow

Judge Agent receives Agent Answer + Rubric
Tree Traversal (Top-down decomposition)
Leaf Node Verification (Extractor + Verifier)
Score Aggregation (Bottom-up calculation)

System Modules

Rubric Tree (Evaluation Logic)

Defines the hierarchical structure of task criteria, distinguishing between critical and non-critical nodes

Model or implementation: N/A (Data Structure)

Extractor (Verification)

Parses the agent's answer text to extract structured information (item names, prices, URLs) relevant to a specific rubric node

Model or implementation: OpenAI o4-mini

Verifier (Verification)

Examines extracted text and webpage screenshots/content to determine if a criteria is met (binary 0/1)

Model or implementation: OpenAI o4-mini

Score Aggregator (Evaluation Logic)

Propagates scores up the tree. Fails parent if Critical Node fails; averages Non-Critical nodes

Model or implementation: Deterministic Algorithm

Novel Architectural Elements

Tree-structured rubric combining 'critical' (gating) and 'non-critical' (averaging) nodes to model complex task satisfaction
Separation of Extractor and Verifier roles within the judge agent to handle long-context unstructured answers

Modeling

Base Model: OpenAI o4-mini used for Judge Agent tools (Extractor/Verifier)

Training Method: Prompt Engineering + Tool Use (No Fine-Tuning reported for the Judge)

Compute: Judge agent inference time not explicitly reported, but tasks themselves take ~10-20 mins for agents to solve

Comparison to Prior Work

vs. BrowseComp: Mind2Web 2 evaluates time-varying answers using Agent-as-a-Judge, whereas BrowseComp relies on static answer strings [not cited in paper]
vs. GAIA/AssistantBench: Focuses on significantly longer horizons (dozens/hundreds of actions vs <10) and complex report generation rather than short answers
vs. WebVoyager: Uses hierarchical tree rubrics for granular partial credit, rather than simple success/fail LLM judging

Limitations

Dependency on proprietary models (OpenAI o4-mini) for the judge agent
Judge agent may struggle if information is hidden in collapsed web content (verified in error analysis)
Strict rubric structure requires significant human effort (1000+ hours) to construct initially
Evaluates English tasks only

Reproducibility

Code: https://osu-nlp-group.github.io/Mind2Web-2/

Publicly available: Benchmark data (10 dev tasks), evaluation toolkit, and leaderboard. Missing: Private test set (120 tasks) is withheld to prevent contamination. Code for the judge agent pipeline is generated and refined via a specialized pipeline described in the paper.

📊 Experiments & Results

Evaluation Setup

130 realistic, long-horizon tasks across diverse domains (Travel, Shopping, Research). 10 tasks in public dev set, 120 in private test set.

Benchmarks:

Mind2Web 2 (Long-horizon agentic web search) [New]

Metrics:

Partial Completion (Average root node score)
Success Rate (Percentage of tasks with score 1.0)
Pass@3 (Success rate with 3 attempts)
Statistical methodology: Reported mean and standard deviation over 3 runs per task.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of different agentic systems showing Deep Research models outperforming standard search-augmented LLMs.
Mind2Web 2	Partial Completion	0.26	0.54	+0.28
Mind2Web 2	Success Rate	0.06	0.28	+0.22
Mind2Web 2	Pass@3	0.36	0.40	+0.04
Human performance comparison highlights the gap remaining for AI agents.
Mind2Web 2 (Subset-30)	Partial Completion	0.54	0.79	+0.25
Mind2Web 2 (Subset-30)	Success Rate	0.28	0.54	+0.26

Experiment Figures

Scatter plot of Average Partial Completion vs. Average Task Completion Time for various agents

Bar chart of error types (Incompleteness, Criteria Violation, Invalid Attribution, etc.) across agents and humans

Main Takeaways

Deep Research systems (OpenAI, Gemini, Grok) significantly outperform search-augmented LLMs (ChatGPT, Perplexity) due to better tool use and long-context management
Web browsing capability is crucial: Systems with live browsing (Operator, Deep Research) handle time-varying tasks better than API-only systems
Agents are faster but less accurate than humans: OpenAI Deep Research reaches 54% partial completion in 8.4 mins, while humans reach 79% in 18.4 mins
Incompleteness is the major failure mode: Agents often terminate early or miss partial requirements rather than hallucinating wildly

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM-based agents and tool use
Familiarity with web browsing environments (DOM, URLs)
Basic knowledge of evaluation metrics (Recall, Precision, Pass@k)

Key Terms

Agentic Search: Systems where agents autonomously browse the web, synthesize information, and return citation-backed answers (e.g., Deep Research)

Agent-as-a-Judge: Using an autonomous AI agent to evaluate the outputs of another AI system, often by verifying claims against external tools or rubrics

Rubric Tree: A hierarchical evaluation structure where a task is broken down into granular criteria (leaf nodes) aggregated to form a final score

Time-varying tasks: Tasks where the correct answer changes over time (e.g., stock prices, weather, availability), requiring real-time verification

Attribution: The practice of citing sources (URLs) that factually support the statements made in the generated answer

Partial Completion: A metric representing the average root node score (0 to 1) across tasks, reflecting partial satisfaction of criteria

Pass@3: A metric indicating whether at least one of three independent attempts for a task resulted in a full success score of 1

Deep Research systems: Agents optimized for long-horizon information gathering, often capable of running for extended periods (30+ mins) to synthesize reports

Generation-verification asymmetry: The concept that generating a complex answer is computationally/cognitively harder than verifying if a specific answer meets defined criteria