DeepSearchQA: Bridging the Comprehensiveness Gap for Deep Research Agents

📝 Paper Summary

Web agents Agent evaluation

DeepSearchQA introduces a 900-prompt benchmark evaluating autonomous web agents on their ability to perform deep, multi-step research to generate exhaustive, verifiable answer sets rather than single data points.

Core Problem

Current agent benchmarks rely on single-answer retrieval tasks (like 'What is the capital of France?'), which fail to evaluate higher-order capabilities like systematic collation, de-duplication, and stopping criteria required for deep research.

Why it matters:

Real-world user needs often require comprehensive lists (e.g., 'All companies with P/E < 20'), not just single facts.
Existing precision-focused benchmarks incentivize distinct search trajectories rather than the exhaustive exploration needed to close the 'Comprehensiveness Gap'.
Current evaluation methods mask critical failure modes like premature stopping (under-retrieval) and hedging (over-retrieval/hallucination) in autonomous agents.

Concrete Example: For the query 'List all companies in the semiconductor sector with a P/E ratio under 20...', a standard agent might find one example and stop, whereas a deep research agent must visit hundreds of sources, de-duplicate entities, and decide when the list is complete.

Key Novelty

DeepSearchQA Benchmark

Shifts evaluation from precision-based single-answer retrieval to exhaustive answer set generation, requiring agents to balance exploration (casting a wide net) and exploitation (verifying candidates).
Categorizes tasks into Structured Retrieval, Context Management, and Logical Reasoning to diagnose specific cognitive bottlenecks.
Uses a strict outcome-based evaluation metric (F1 Score on answer sets) to penalize both under-retrieval (missing items) and over-retrieval (hallucinations/drift).

Architecture

Distribution of domains in the DeepSearchQA dataset.

Evaluation Highlights

Gemini Deep Research Agent achieves state-of-the-art performance with 66.09% Fully Correct success rate and 81.90% F1 score.
Reasoning models without agentic loops struggle: Gemini 2.5 Flash achieves only 42.99% F1, roughly half that of the agentic leader, with a 45.27% Fully Incorrect rate.
Allocating more test-time compute via sampling (n=8) increases the Fully Correct rate from 67.18% (n=1) to 85.71%.

Breakthrough Assessment

8/10

Significant shift in evaluation paradigm addressing the 'Comprehensiveness Gap'. The focus on set-based answers effectively isolates complex agentic behaviors like stopping criteria and de-duplication.

⚙️ Technical Details

Problem Definition

Setting: Autonomous information-seeking agents interacting with the open web to answer complex queries.

Inputs: Natural language query q requiring a set of answers (e.g., list of items) or a deep-research single answer.

Outputs: A set of distinct answers S_i submitted by the agent.

Pipeline Flow

Benchmark Evaluation Pipeline: Agent Input → Agent Search/Reasoning → Answer Set Submission → LLM-as-a-Judge Verification

System Modules

Agent Under Test

Execute search plan on the open web and generate answer set

Model or implementation: Various (Gemini Deep Research Agent, GPT-5 Pro, etc.)

LLM-as-a-Judge

Determine semantic equivalence between extracted answers and ground truth items

Model or implementation: Gemini 2.5 Flash

Modeling

Base Model: Evaluates multiple models: Gemini Deep Research Agent, GPT-5 Pro High Reasoning, o3 Deep Research, o4 Mini Deep Research, Gemini 2.5 Flash

Comparison to Prior Work

vs. SimpleQA: DeepSearchQA requires exhaustive set generation (recall-oriented) rather than just precision-oriented single-answer retrieval.
vs. GAIA/BrowseComp: DeepSearchQA focuses explicitly on the 'Comprehensiveness Gap'—systematic collation and de-duplication of lists.
vs. Fact-Fetch-Reason [not cited in paper]: DeepSearchQA introduces 'stopping criteria' under epistemic uncertainty as a core evaluation metric.

Limitations

Evaluation relies on an LLM-as-a-judge (Gemini 2.5 Flash), which may introduce verification errors.
Ground truth is based on time-anchored or static data, potentially limiting relevance to real-time dynamic events.
High costs associated with running deep research agents for evaluation.
The 'Last Mile Problem' persists where agents have high F1 but fail strict 'Fully Correct' criteria due to minor hedging or over-retrieval.

Reproducibility

Code: https://www.kaggle.com/benchmarks/google/dsqa/leaderboard

The DeepSearchQA dataset and leaderboard are available on Kaggle (https://www.kaggle.com/benchmarks/google/dsqa/leaderboard). The code for the specific proprietary agents evaluated (Gemini Deep Research, GPT-5 Pro) is not released. The LLM-as-a-Judge prompt is provided in Appendix A.

📊 Experiments & Results

Evaluation Setup

900-prompt benchmark across 17 fields (Politics, Finance, Science, etc.). Agents access the open web.

Benchmarks:

DeepSearchQA (Multi-step information-seeking / Exhaustive list generation) [New]

Metrics:

F1-Score
Fully Correct rate (Exact Set Match)
Fully Incorrect rate
Correct with Extraneous Answers rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main leaderboard results comparing deep research agents and reasoning models.
DeepSearchQA	Fully Correct	65.18	66.09	+0.91
DeepSearchQA	Fully Incorrect	14.13	9.95	-4.18
DeepSearchQA	F1 Score	79.82	81.90	+2.08
DeepSearchQA	F1 Score	42.99	81.90	+38.91
DeepSearchQA	Fully Correct	44.24	66.09	+21.85
Test-time compute scaling experiments showing improvements with more samples.
DeepSearchQA	Fully Correct	67.18	85.71	+18.53

Main Takeaways

Deep Research Agents significantly outperform standalone reasoning models, confirming the necessity of iterative agentic loops for exhaustive research.
A 'Last Mile Problem' exists: a ~15-point gap between F1 scores and strict 'Fully Correct' rates indicates agents struggle to filter noise and stop at the exact right time.
Smaller models (Gemini 2.5 Flash) fail catastrophically (45.27% Fully Incorrect), suggesting a hard reasoning threshold is required for deep research tasks.
Two distinct failure modes emerge: under-retrieval (premature stopping) and hedging (including low-confidence answers to boost recall).

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Model (LLM) agents
Information Retrieval metrics (Precision, Recall, F1)
Web browsing agent architectures

Key Terms

Systematic Collation: The ability to visit disparate sources and aggregate fragmented information into a single master list.

Entity Resolution: Identifying when two retrieved entities are identical despite having different names or surface forms (de-duplication).

Stopping Criteria: The decision-making process where an agent determines it has found all possible answers and ends the search.

Hedging: A failure mode where an agent provides multiple candidate answers (e.g., 'Brazil and Italy') instead of committing to the single correct one.

F1 Score: The harmonic mean of Precision and Recall, used here to measure the quality of the retrieved answer set against the ground truth.

Deep Research Agent: An autonomous agent designed to execute complex search plans, manage memory, and perform multi-step reasoning over long horizons.

Comprehensiveness Gap: The disparity between an agent's ability to retrieve a single fact versus generating an exhaustive list of all relevant items.

Fully Correct: A metric category where the agent's submitted set is semantically identical to the ground truth (Recall=1.0, Precision=1.0).

Fully Incorrect: A metric category where the intersection between the submitted answer set and the ground truth is empty.