Deep Research: A Systematic Survey

📝 Paper Summary

Agentic RAG pipeline Deep Research Agentic Information Seeking Full-stack AI Scientist

This survey formalizes Deep Research as an evolving paradigm where LLMs act as autonomous agents that plan queries, acquire evidence, manage memory, and synthesize comprehensive reports for open-ended tasks.

Core Problem

Standard RAG and single-shot prompting fail on open-ended tasks requiring critical thinking, multi-step verification, and long-horizon reasoning, as they lack autonomous workflows to decompose problems and manage extensive context.

Why it matters:

Real-world research tasks demand verifiable, self-contained outputs based on multi-source evidence, which simple retrieval augmentation cannot provide.
Existing surveys focus on static RAG or general web agents, missing the specific technical landscape of end-to-end research systems that synthesize long-form grounded reports.
Current LLMs suffer from hallucination and context loss when attempting complex, multi-step investigations without structured planning and memory management.

Concrete Example: When asked a complex question requiring cross-referencing multiple sources (e.g., a competitive market analysis), a standard RAG system might retrieve fragmented facts and hallucinate connections. A Deep Research system iteratively decomposes the query, browses live web pages, filters noise, and synthesizes a structured report with citations.

Key Novelty

Three-Stage Deep Research Roadmap & Component Taxonomy

Formalizes a three-phase evolution: from 'Agentic Search' (finding facts) to 'Integrated Research' (synthesizing reports) to 'Full-stack AI Scientist' (hypothesis generation and discovery).
Deconstructs the research workflow into four distinct, interactive components: Query Planning (decomposition), Information Acquisition (retrieval/tools), Memory Management (context maintenance), and Answer Generation (synthesis).

Architecture

An overview of the four key components in a general Deep Research system and their interaction loop.

Evaluation Highlights

Provides a structured taxonomy of 100+ representative systems and datasets across diverse research tasks (e.g., AutoSurvey, Search-R1, TheAIScientist).
Categorizes evaluation into four domains: Agentic Information Seeking, Comprehensive Report Generation, AI for Research (idea/experiment generation), and Software Engineering.
Identifies key optimization techniques: Workflow Prompting (e.g., DeepResearch), Supervised Fine-Tuning (e.g., WebThinker), and End-to-End Reinforcement Learning (e.g., Search-R1).

Breakthrough Assessment

9/10

This is a foundational survey that defines and structures the emerging field of Deep Research, providing clear taxonomies and a roadmap that distinguishes it from standard RAG.

⚙️ Technical Details

Problem Definition

Setting: Open-ended, long-horizon information seeking and synthesis tasks where truth is distributed across multiple sources.

Inputs: Complex research question or topic q

Outputs: Comprehensive, verifiable long-form report or answer Ans with explicit citations

Pipeline Flow

Query Planning (Decomposition/Reformulation)
Information Acquisition (Retrieval/Tool Use)
Memory Management (Update/Prune Context)
Answer Generation (Synthesis/Verification)

System Modules

Query Planning

Decompose complex user inputs into executable sub-queries

Model or implementation: Varies (LLM-based planner)

Information Acquisition

Gather evidence from external sources using tools

Model or implementation: Search Engines (Google/Bing) or Neural Retrievers (ColBERT/SPLADE)

Memory Management

Maintain relevant context over long horizons

Model or implementation: Vector Stores / Summarization Models

Answer Generation

Synthesize accumulated evidence into final output

Model or implementation: LLM Generator

Novel Architectural Elements

Three-stage roadmap formalization: Agentic Search → Integrated Research → Full-stack AI Scientist
Hierarchical decomposition of Deep Research into four interacting components with specific sub-taxonomies (e.g., Retrieval Timing within Acquisition, Forgetting within Memory)

Modeling

Base Model: Review of multiple systems (e.g., GPT-4, Llama-3, Search-R1)

Training Method: Review of methods: Workflow Prompting, Supervised Fine-Tuning (SFT), End-to-End Reinforcement Learning

Objective Functions:

Purpose: Optimize query planning.

Formally: Reward modeling based on retrieval utility (e.g., NDCG, recall) or final answer correctness.
Purpose: Align generation with evidence.

Formally: Preference optimization (DPO/PPO) using citations or factual consistency as signals.

Adaptation: Various (Full fine-tuning, LoRA depending on specific system reviewed)

Trainable Parameters: Varies by system

Training Data:

Synthetic data generation (e.g., distillation from strong models)
Rejection sampling on reasoning paths

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard RAG: DR employs autonomous planning, iterative execution, and long-horizon memory management, whereas RAG is typically a fixed retrieve-then-generate pipeline.
vs. Web Agents: DR emphasizes 'research' outputs (synthesis, verification, report generation) over general task completion (e.g., booking flights).

Limitations

Retrieval Timing remains a challenge; deciding exactly when to stop searching is difficult.
Memory Evolution is complex; current systems struggle with efficient updating and forgetting mechanisms over very long contexts.
Instability in Training Algorithms; RL for multi-step reasoning is prone to noise and credit assignment issues.
Evaluation is difficult; judging 'novelty' or 'insight' in Phase III systems is subjective and hard to automate.

Reproducibility

Code: https://github.com/mangopy/Deep-Research-Survey

The paper is a survey and does not propose a single new model. However, it provides a curated list of papers and a GitHub repository (https://github.com/mangopy/Deep-Research-Survey) tracking the field.

📊 Experiments & Results

Evaluation Setup

Survey of evaluation methodologies across four domains.

Benchmarks:

GAIA (General AI Assistants benchmark)
GPQA (Graduate-Level Google-Proof Q&A)
DeepResearch Bench (Long-form report generation evaluation)
TheAIScientist (Automated scientific discovery)

Metrics:

Success Rate
Factuality / Citation Accuracy
Coherence
Novelty (for Phase III)
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Three commonly-used types of query planning strategies: Parallel, Sequential, and Tree-based.

Main Takeaways

Deep Research systems are evolving from simple search agents (Phase I) to integrated research assistants (Phase II) and eventually autonomous scientists (Phase III).
Key technical challenges shifting from simple retrieval accuracy to complex planning, memory management, and verifiable long-form synthesis.
Evaluation is moving from exact-match metrics (EM/F1) to holistic assessments of report quality, factuality, and agentic trajectory success.
Future directions include proactive memory evolution, stable RL training for long reasoning chains, and better automated evaluation for open-ended research tasks.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG) fundamentals
Large Language Model (LLM) agent architectures
Reinforcement Learning (RL) basics (PPO, GRPO)

Key Terms

Deep Research (DR): An agentic workflow where LLMs autonomously plan, retrieve, reason, and synthesize information to solve complex, open-ended problems beyond simple Q&A.

RAG: Retrieval-Augmented Generation—AI systems that answer questions by retrieving documents from a static corpus before generating a response.

MCTS: Monte Carlo Tree Search—a heuristic search algorithm used for decision processes, here applied to exploring reasoning paths in query planning.

SFT: Supervised Fine-Tuning—training a model on labeled examples to adapt it to specific tasks like tool use or query decomposition.

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm used to optimize policies based on group-wise comparisons of outputs.

PPO: Proximal Policy Optimization—a standard reinforcement learning algorithm for training agents.

CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before the final answer.

SPLADE: Sparse Lexical and Expansion Model—a neural retrieval method that learns sparse representations for efficient keyword-based search.

ColBERT: Contextualized Late Interaction over BERT—a dense retrieval model that matches token-level embeddings.

LayoutLM: A document understanding model that incorporates text layout and visual information.

Multimodal Retrieval: Retrieval systems that index and search across text, images, charts, and tables simultaneously.