← Back to Paper List

DeepSearchQA: Bridging the Comprehensiveness Gap for Deep Research Agents

N Gupta, R Chatterjee, L Haas, C Tao, A Wang, C Liu…
Google
arXiv, 1/2026 (2026)
Agent Factuality Reasoning Benchmark

📝 Paper Summary

Web agents Agent evaluation
DeepSearchQA introduces a 900-prompt benchmark evaluating autonomous web agents on their ability to perform deep, multi-step research to generate exhaustive, verifiable answer sets rather than single data points.
Core Problem
Current agent benchmarks rely on single-answer retrieval tasks (like 'What is the capital of France?'), which fail to evaluate higher-order capabilities like systematic collation, de-duplication, and stopping criteria required for deep research.
Why it matters:
  • Real-world user needs often require comprehensive lists (e.g., 'All companies with P/E < 20'), not just single facts.
  • Existing precision-focused benchmarks incentivize distinct search trajectories rather than the exhaustive exploration needed to close the 'Comprehensiveness Gap'.
  • Current evaluation methods mask critical failure modes like premature stopping (under-retrieval) and hedging (over-retrieval/hallucination) in autonomous agents.
Concrete Example: For the query 'List all companies in the semiconductor sector with a P/E ratio under 20...', a standard agent might find one example and stop, whereas a deep research agent must visit hundreds of sources, de-duplicate entities, and decide when the list is complete.
Key Novelty
DeepSearchQA Benchmark
  • Shifts evaluation from precision-based single-answer retrieval to exhaustive answer set generation, requiring agents to balance exploration (casting a wide net) and exploitation (verifying candidates).
  • Categorizes tasks into Structured Retrieval, Context Management, and Logical Reasoning to diagnose specific cognitive bottlenecks.
  • Uses a strict outcome-based evaluation metric (F1 Score on answer sets) to penalize both under-retrieval (missing items) and over-retrieval (hallucinations/drift).
Architecture
Architecture Figure Figure 1
Distribution of domains in the DeepSearchQA dataset.
Evaluation Highlights
  • Gemini Deep Research Agent achieves state-of-the-art performance with 66.09% Fully Correct success rate and 81.90% F1 score.
  • Reasoning models without agentic loops struggle: Gemini 2.5 Flash achieves only 42.99% F1, roughly half that of the agentic leader, with a 45.27% Fully Incorrect rate.
  • Allocating more test-time compute via sampling (n=8) increases the Fully Correct rate from 67.18% (n=1) to 85.71%.
Breakthrough Assessment
8/10
Significant shift in evaluation paradigm addressing the 'Comprehensiveness Gap'. The focus on set-based answers effectively isolates complex agentic behaviors like stopping criteria and de-duplication.
×