Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL

📝 Paper Summary

Agentic AI Web Agents

ASearcher unlocks expert-level search intelligence by enabling extremely long-horizon tool use (100+ turns) via fully asynchronous RL and high-quality synthetic data generation.

Core Problem

Existing online RL for search agents is limited by small turn limits (≤10), preventing complex strategy learning, and suffers from a lack of high-quality, challenging training data.

Why it matters:

Complex real-world queries often require resolving conflicting information and deep exploration beyond just a few search steps
Batch generation RL systems are inefficient with long trajectories due to blocking, causing GPU idle time
Open-source agents lag behind proprietary models in 'Search Intelligence'—the ability to resolve ambiguity and verify conclusions

Concrete Example: For a GAIA question about medals won by China in the 2012 Olympics, standard agents fail due to conflicting online reports (38 vs 39 gold medals). They cannot perform the deep verification needed to identify doping disqualifications as the root cause of the conflict.

Key Novelty

ASearcher: Fully Asynchronous Agentic RL

Decouples trajectory execution from model updates, allowing extremely long horizons (e.g., 128 turns) without the 'straggler problem' where one long trajectory blocks the whole batch
Uses a self-evolving data synthesis agent that iteratively 'fuzzes' queries (adds ambiguity) and 'injects' facts to create challenging, grounded training data from seed questions

Architecture

Figure 2 shows the agent loop (Search/Browser tools); Figure 7 shows the Fully Asynchronous RL System vs. One-step-off RL

Evaluation Highlights

+78.0% improvement on xBench-DeepSearch and +34.3% on GAIA for the QwQ-32B based agent after RL training
Achieves Avg@4 score of 58.7 on GAIA and 51.1 on xBench-DeepSearch, surpassing existing open-source 32B agents
Demonstrates extreme long-horizon capabilities with tool calls exceeding 100 turns and output tokens exceeding 400k during training

Breakthrough Assessment

9/10

Significantly pushes the boundary of open-source agentic search by successfully training effective 100+ turn trajectories, addressing both the system efficiency bottleneck and the data scarcity problem.

⚙️ Technical Details

Problem Definition

Setting: Markov Decision Process (MDP) where state includes history/search results, and action includes generated tokens/tool calls

Inputs: Natural language query x

Outputs: Final answer after multi-turn tool interaction

Pipeline Flow

Data Synthesis Agent (Offline) -> Training Data
Search Agent (Inference/Training Loop): Query -> [Model generates Action] -> [Tool Execution] -> [Observation appended] -> Repeat -> Final Answer

System Modules

Search Agent (Base LLM)

Generate reasoning, tool calls, and final answers

Model or implementation: Qwen2.5-7B/14B or QwQ-32B

Search Engine (Tools)

Retrieve web snippets and URLs

Model or implementation: External API (Search Engine)

Web Browser (Tools)

Retrieve full webpage content

Model or implementation: Headless Browser

Novel Architectural Elements

Fully asynchronous rollout-update decoupling: Trajectory execution is completely independent of model updates, allowing trajectories of vastly different lengths (10 vs 100 turns) to coexist without blocking batch formation
LRM-specific history compression: Discards intermediate thinking tokens but keeps summarized thoughts and tool calls to manage context length for long-horizon reasoning

Modeling

Base Model: Qwen2.5-7B/14B and QwQ-32B

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Optimize policy based on relative rewards within a group of trajectories.

Formally: Minimize L_GRPO = -E [ (pi(a|s)/pi_old(a|s)) * A_hat ] + beta * KL_divergence

Training Data:

Filtered HotpotQA/2WikiMultiHopQA (16k samples)
WebWalkerQA subset
Synthesized Data: 134k samples generated from 14k seeds via iterative fuzzing and injection

Key Hyperparameters:

turn_limit: 128 (during training)
group_size_G: Not reported in the paper
beta_KL: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. Search-R1: Enables 100+ turns vs. ≤10 turns; fully asynchronous vs. synchronous/semi-asynchronous batching
vs. Search-o1: Validates conclusions and resolves conflicts via RL training vs. prompt-engineering only
vs. Offline RL [not cited in paper]: Interacts with live environment during training to learn dynamic error correction

Limitations

Dependency on external search engine APIs and internet availability
Computational cost of extremely long trajectories (400k+ tokens) is high
Risk of reward hacking if the reward signal (final answer correctness) is sparse or noisy

Reproducibility

Code: https://github.com/inclusionAI/ASearcher

publicly available (https://github.com/inclusionAI/ASearcher). Models, training data, and codes are open-sourced.

📊 Experiments & Results

Evaluation Setup

Complex Question Answering requiring internet search

Benchmarks:

GAIA (General AI Assistants benchmark (complex, multi-modal))
xBench-DeepSearch (Long-horizon search tasks)
Frames (Multi-hop reasoning)

Metrics:

Avg@4 (Average accuracy over 4 samples)
Pass@4 (Accuracy if at least one of 4 samples is correct)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance of the flagship LRM-based agent (QwQ-32B) compared to baselines on difficult benchmarks.
GAIA	Avg@4	43.7	58.7	+15.0
xBench-DeepSearch	Avg@4	28.7	51.1	+22.4
GAIA	Pass@4	71.9	74.7	+2.8
xBench-DeepSearch	Pass@4	63.0	75.0	+12.0

Experiment Figures

Impact of turn limits on accuracy and the distribution of tool calls/tokens during training

Main Takeaways

RL training significantly enhances search capabilities, improving base QwQ performance by 34-78% on complex benchmarks
Allowing more search turns correlates with higher accuracy; artificially limiting turns (as in prior work) harms performance
ASearcher-Web-QwQ learns expert behaviors: uncertainty decomposition, precise information extraction, and cross-document verification
The fully asynchronous training pipeline effectively handles the high variance in trajectory length (some taking 10x more time than others)

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO/GRPO)
Large Language Models (LLMs)
Agentic workflows (Tool use)

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages based on the relative rewards of a group of trajectories for the same input

LRM: Large Reasoning Model—LLMs specifically optimized for complex reasoning tasks (e.g., QwQ-32B)

Search Intelligence: The ability to resolve ambiguous queries, generate precise searches, analyze results, and conduct thorough exploration to resolve conflicts

append-only prompting: A history management technique where all new actions and observations are simply added to the end of the context window

fuzzing: A data synthesis technique where specific details in a seed question are obscured or generalized to increase difficulty and uncertainty

injection: A data synthesis technique where external facts are inserted into a seed question to enrich context and complexity

Pass@4: A metric evaluating if the correct answer is present within 4 attempts/samples

Avg@4: The average score across 4 samples