WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research

📝 Paper Summary

Open-Ended Deep Research (OEDR) Multi-Agent Systems

WebWeaver is a dual-agent framework that decouples research into an iterative planner that co-evolves outlines with search, and a hierarchical writer that synthesizes reports section-by-section using targeted memory retrieval.

Core Problem

Current AI research agents use static pipelines that fail to adapt plans based on new findings, or monolithic generation methods that suffer from 'lost-in-the-middle' phenomena and citation hallucinations.

Why it matters:

Proprietary solutions are prohibitively expensive and restrictive, hindering academic research.
Static 'search-then-generate' open-source methods lack coherence and produce low-quality reports.
Feeding all gathered materials (100+ pages) into a single context window causes severe hallucinations and poor citation accuracy due to attentional saturation.

Concrete Example: In DeepResearch Bench, most proprietary agents fail on citation accuracy (FACT). Standard approaches either generate an outline before searching (missing emergent info) or search before outlining (constraining scope). When writing, feeding 100k+ tokens of raw evidence causes the model to overlook crucial details or hallucinate citations.

Key Novelty

Dual-Agent Human-Centric Research Loop

**Planner:** Uses a dynamic cycle where searching and outlining co-evolve. Emergent search results reshape the outline, and the refined outline guides subsequent searches, mirroring human research.
**Writer:** Abandons monolithic generation for a hierarchical, section-by-section approach. It retrieves only relevant evidence from a structured memory bank for specific sections to prevent context overflow.

Architecture

The dual-agent workflow comprising the Planner and Writer agents.

Evaluation Highlights

State-of-the-art on DeepResearch Bench, achieving 93.37% citation accuracy (C. Acc.) and 52.88 overall score, outperforming OpenAI DeepSearch and Gemini-2.5-pro.
Achieves highest win rate (66.86%) on DeepConsult benchmark against OpenAI DeepSearch baseline.
Demonstrates effective agentic finetuning: A 32B model finetuned on WebWeaver-3k data improves citation accuracy from ~25% to 85.90%.

Breakthrough Assessment

9/10

Significantly outperforms top proprietary models (OpenAI, Gemini) on citation accuracy and report quality. The dual-agent architecture effectively solves the context window saturation problem for long reports.

⚙️ Technical Details

Problem Definition

Setting: Open-Ended Deep Research (OEDR) without ground-truth answers.

Inputs: An open-ended research question.

Outputs: A comprehensive report with accurate citations.

Pipeline Flow

Planner: Receive Question → Iterative Loop (Search ↔ Outline Optimization) → Final Outline
Writer: Receive Outline → Hierarchical Loop (Retrieve Section Evidence → Write Section → Prune Context) → Final Report

System Modules

Planner

Executes dynamic research cycle interleaving evidence acquisition with outline optimization.

Model or implementation: Claude-sonnet-4-20250514 (default), Qwen3-30b (finetuned)

Memory Bank

Stores raw evidence (quotes, data) and summaries extracted from web pages.

Model or implementation: N/A (Storage structure)

Writer

Constructs report section by section using targeted retrieval.

Model or implementation: Claude-sonnet-4-20250514 (default)

Novel Architectural Elements

Co-evolutionary Planning Loop: Outline is not static; it is optimized iteratively based on search results, which in turn guide further search.
Dynamic Retrieval-and-Pruning: The Writer clears source materials from context after finishing a section to maintain coherence and reduce hallucination.

Modeling

Base Model: Claude-sonnet-4-20250514 (main agent), Qwen3-30b-a3b-instruct-2507 (for SFT experiments)

Training Method: Supervised Fine-Tuning (SFT)

Adaptation: Full parameter fine-tuning

Training Data:

WebWeaver-3k dataset
3.3k planning trajectories
3.1k writing trajectories
Generated by tier teacher model (WebWeaver framework) filtered for high fidelity

Key Hyperparameters:

learning_rate: 7e-6
iterations: 1000

Compute: 16 NVIDIA H20 GPUs

Comparison to Prior Work

vs. OpenAI DeepResearch: WebWeaver uses a transparent dual-agent loop and achieves higher citation accuracy.
vs. Storm [not cited in paper]: Storm typically fixes the outline early; WebWeaver co-evolves the outline and search iteratively.
vs. LongWriter: WebWeaver uses hierarchical retrieval-and-pruning rather than feeding all context at once, reducing 'lost-in-the-middle' errors.

Limitations

Heavy reliance on the quality of the underlying LLM for the Planner's reasoning.
Latency may be high due to the iterative nature of the search and sequential writing process.
Cost of multiple LLM calls for outline optimization and per-section writing could be significant.

Reproducibility

Code availability is not provided. Dataset WebWeaver-3k is mentioned but no download URL is explicitly provided in the text. Finetuning details (learning rate, GPUs) are provided.

📊 Experiments & Results

Evaluation Setup

Open-ended research report generation evaluated by LLM judges.

Benchmarks:

DeepResearch Bench (PhD-level complex research tasks (100 tasks))
DeepConsult (Business and consulting research prompts)
DeepResearchGym (Real-world information-seeking queries (100 queries))

Metrics:

RACE (Report Quality: Comprehensiveness, Insight, Instruction-Following, Readability)
FACT (Citation Accuracy, Effective Citations)
Win Rate (vs. OpenAI DeepSearch)
Average Quality Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
WebWeaver outperforms all proprietary and open-source baselines on DeepResearch Bench, particularly in citation accuracy.
DeepResearch Bench	Overall Score	48.24	52.88	+4.64
DeepResearch Bench	Citation Accuracy (C. Acc.)	82.52	93.37	+10.85
DeepResearch Bench	Comprehensiveness	46.12	50.82	+4.70
DeepConsult	Win Rate	50.00	66.86	+16.86
DeepResearchGym	Average Score	92.03	96.77	+4.74
Ablation studies demonstrate the critical role of the hierarchical writer over brute-force generation.
DeepResearch Bench	Citation Accuracy	86.73	93.37	+6.64
DeepResearch Bench	Insight	42.72	50.02	+7.30
Finetuning a smaller model (Qwen3-30B) on WebWeaver-3k data yields massive improvements, approaching expert performance.
DeepResearch Bench	Citation Accuracy	25.00	85.90	+60.90

Experiment Figures

Impact of outline optimization rounds on DeepResearch Bench scores.

Context token usage comparison between Hierarchical Writer and Brute-force Writer.

Main Takeaways

Iterative Outline Optimization works: Later-round outlines achieve near-perfect depth/breadth scores, proving that co-evolving search and planning is superior to static planning.
Hierarchical Writing solves context limits: By processing one section at a time with only relevant evidence, the writer avoids 'lost-in-the-middle' issues and drastically improves citation accuracy.
Agentic skills are learnable: A 30B model finetuned on trajectories generated by the framework (WebWeaver-3k) recovered expert-level behaviors, particularly in tool use and citation grounding.

📚 Prerequisite Knowledge

Prerequisites

Agentic workflows (ReAct)
Retrieval-Augmented Generation (RAG)
Context window limitations (Lost-in-the-middle phenomenon)

Key Terms

OEDR: Open-Ended Deep Research—complex challenges where agents synthesize vast web-scale information into reports.

Planner: An agent role responsible for iteratively searching the web and optimizing the report outline based on findings.

Writer: An agent role responsible for retrieving specific evidence for each outline section and generating text.

RACE: A metric suite for Report Quality assessing Comprehensiveness, Insight, Instruction-Following, and Readability.

FACT: A metric suite for Web Retrieval assessing Citation Accuracy and Average Effective Citations.

Lost-in-the-middle: A phenomenon where LLMs fail to retrieve or use information located in the middle of a long input context.

SFT: Supervised Fine-Tuning—training a model on a labeled dataset to improve specific capabilities.

Contextual Bleeding: When information from one section of a text generation task incorrectly influences or merges with the synthesis of another section.