From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents

📝 Paper Summary

Agentic RAG pipeline Self-evolving Agentic reasoning

The paper defines Agentic Deep Research as a new paradigm where LLMs use reinforcement learning and test-time scaling to autonomously plan, execute, and refine complex multi-step information searches.

Core Problem

Traditional keyword search and naive RAG systems fail to handle complex, multi-faceted queries because they cannot autonomously plan research paths or iteratively refine their understanding based on intermediate findings.

Why it matters:

Standard search engines overwhelm users with links for complex topics, requiring heavy manual synthesis
Naive RAG systems (single retrieval step) often retrieve irrelevant context or hallucinate when reasoning requires multiple hops
Current LLMs lack the strategic ability to recognize when initial search results are insufficient and automatically correct course

Concrete Example: For a question like 'What was the conference of the Vermont Catamounts men's soccer team formerly known as from 1988 to 1996?', a naive RAG might search for the team generally and miss the specific historical timeframe. An Agentic Deep Research system would reason that it needs historical conference data, perform a search, realize the name changed, and iteratively search for that specific era's records.

Key Novelty

Agentic Deep Research Paradigm & Test-Time Scaling (TTS) Law for Search

Proposes a paradigm shift where reasoning is not just post-processing but the driver of the search process, deciding when and what to query via a dynamic feedback loop
Formalizes a 'Test-Time Scaling Law for Deep Research', hypothesizing that increasing inference-time computation (reasoning depth) leads to better search outcomes and synthesis
advocates for Reinforcement Learning (RL) over simple prompting to incentivize agents to explore and optimize search strategies autonomously

Architecture

Evolution of search paradigms from Web Search to LLMs as Chatbots, to LLMs with RAG, and finally to Agentic Deep Research.

Evaluation Highlights

OpenAI Deep Research agent achieves 51.5% on BrowseComp, significantly outperforming standard LLMs (typically <10%)
On the Humanity's Last Exam (HLE) benchmark, OpenAI Deep Research scores 26.6% compared to standard LLMs scoring under 20%
Achieves 42.9% on BrowseComp-ZH (Chinese web search), demonstrating cross-lingual deep research capabilities

Breakthrough Assessment

9/10

This is a position paper that defines a major paradigm shift. While it relies on existing models (like OpenAI's) for results, its formalization of 'Agentic Deep Research' and the TTS law for search unifies fragmented efforts into a coherent field.

⚙️ Technical Details

Problem Definition

Setting: Open-ended complex information seeking and synthesis

Inputs: Complex, multi-faceted user query q

Outputs: Comprehensive, synthesized answer based on iterative retrieval and reasoning

Pipeline Flow

Reasoning/Planning Agent (analyzes query)
Action Execution (Search/Browse)
Information Synthesis & Verification
Iterative Feedback Loop (decide to stop or search more)

System Modules

Reasoning Agent

Decomposes complex queries, plans search steps, and decides when to stop

Model or implementation: LLM optimized for reasoning (e.g., via RL)

Search/Browser Tool

Interacts with the environment (web or API) to fetch data

Model or implementation: External Search Engine or Browser API

Synthesizer/Critic

Evaluates retrieved information against the plan and synthesizes findings

Model or implementation: LLM

Novel Architectural Elements

Integration of Test-Time Scaling (TTS) into the search loop: allowing the model to 'think' longer (generate more reasoning tokens) specifically to plan better searches
Shift from 'Retrieve-then-Read' (sequential) to 'Iterative Synergy' where reasoning and search co-evolve dynamically

Modeling

Base Model: Varies by implementation (paper discusses OpenAI Deep Research, DeepSeek-R1, etc.)

Training Method: Reinforcement Learning (RL) incentivized search

Adaptation: RL-based optimization of reasoning and search policies

Trainable Parameters: Full model or specialized agent modules

Training Data:

Synthetic data generation
Instructional reformulation of existing datasets

Compute: Not reported in the paper

Comparison to Prior Work

vs. Naive RAG: Deep Research is iterative and agentic, not a single linear pass
vs. ReAct: Deep Research emphasizes RL-driven optimization of the search policy rather than just prompting
vs. WebGPT: Deep Research incorporates Test-Time Scaling (TTS) to allow deeper reasoning during the search process
+ 1 more
vs. STORM [not cited in paper]: STORM generates Wikipedia-like articles via multi-perspective QA; Deep Research focuses more broadly on the reasoning-search feedback loop and TTS

Limitations

High computational cost due to iterative search and reasoning (Test-Time Scaling)
Latency issues for real-time applications
Potential for error propagation if early reasoning steps are flawed
Dependence on the quality and availability of underlying search engines

Reproducibility

Code: https://github.com/DavidZWZ/Awesome-Deep-Research

The paper acts as a survey and position paper. It curates a list of resources at https://github.com/DavidZWZ/Awesome-Deep-Research. It does not release a specific new model but analyzes existing ones (OpenAI Deep Research, DeepSeek-R1, etc.).

📊 Experiments & Results

Evaluation Setup

Evaluation of agentic models on complex information-seeking benchmarks

Benchmarks:

BrowseComp (Multi-step open-ended web search)
BrowseComp-ZH (Chinese multi-step web search)
Humanity's Last Exam (HLE) (Expert-level academic/domain questions)

Metrics:

Success Rate / Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
BrowseComp	Success Rate	10.0	51.5	+41.5
BrowseComp-ZH	Success Rate	10.0	42.9	+32.9
Humanity's Last Exam (HLE)	Success Rate	20.0	26.6	+6.6

Experiment Figures

Bar chart comparing OpenAI Deep Research against standard LLMs on BrowseComp, BrowseComp-ZH, and HLE benchmarks.

Line chart showing the rapid growth of GitHub stars for open-source Deep Research projects (e.g., Deep-Searcher, Deer-Flow) in early 2025.

Main Takeaways

Agentic Deep Research systems significantly outperform standard LLMs on tasks requiring multi-step search and synthesis
The performance gap is largest on open-ended web browsing tasks (BrowseComp), confirming the value of agentic capabilities
Test-Time Scaling (TTS) serves as a critical multiplier for performance, enabling models to solve problems that are impossible with static knowledge alone
Open-source implementations (e.g., DeepResearcher, R1-Searcher) are rapidly gaining traction, validating the community shift toward this paradigm

📚 Prerequisite Knowledge

Prerequisites

Understanding of Retrieval-Augmented Generation (RAG)
Familiarity with Reinforcement Learning (RL) concepts (rewards, policies)
Knowledge of Large Language Models (LLMs) and prompting strategies (CoT, ReAct)

Key Terms

RAG: Retrieval-Augmented Generation—systems that enhance LLM responses by fetching external data

Agentic Deep Research: A paradigm where AI agents autonomously plan, execute, and refine multi-step research tasks using reasoning and iterative search

Test-Time Scaling (TTS): Allocating more computational resources during inference (e.g., generating more reasoning steps) to improve performance

CoT: Chain-of-Thought—prompting LLMs to generate intermediate reasoning steps

RLHF: Reinforcement Learning from Human Feedback—training models to align with human preferences

ReAct: Reasoning + Acting—a prompting method where LLMs interleave reasoning traces with action execution

SFT: Supervised Fine-Tuning—training a model on labeled examples

BrowseComp: A benchmark evaluating an agent's ability to conduct multi-step open-ended web searches

HLE: Humanity's Last Exam—a benchmark with expert-level questions across diverse domains requiring deep synthesis

RL: Reinforcement Learning—a training method where agents learn optimal behaviors through trial and error and reward signals

Hallucination: When an LLM generates plausible but factually incorrect information

Multi-hop reasoning: Solving problems that require connecting pieces of information from multiple distinct sources or steps

DeepSeek-R1: A reasoning model that uses reinforcement learning to optimize reasoning chains