Scaling Agents via Continual Pre-training

📝 Paper Summary

Self-evolving Agentic reasoning Multi-call tool use with flexible plan

AgentFounder introduces Agentic Continual Pre-training as an intermediate scaling layer to inject tool-use and reasoning capabilities into foundation models before post-training, resolving optimization conflicts in deep research agents.

Core Problem

General-purpose foundation models lack inherent agentic inductive biases, forcing post-training (SFT/RL) to simultaneously learn capabilities and alignment, which creates optimization tensions and leads to underperformance in complex agentic tasks.

Why it matters:

Existing open-source agents (e.g., WebSailor, DeepSeek-V3.1) significantly lag behind OpenAI's Deep Research (e.g., 30.0 vs 51.5 on BrowseComp) because they rely on general-purpose base models.
SFT relies on high-quality complete trajectories which are scarce and hard to define for long-horizon tasks.
Current post-training locks models into imitating specific patterns rather than developing flexible decision-making capabilities needed for unpredictable environments.

Concrete Example: When training a deep research agent, standard SFT might force a model to memorize a specific search query sequence for a complex question. However, if a search tool fails or returns unexpected results during inference, the model lacks the fundamental decision-making capability to adapt its strategy because it only learned to mimic a fixed path, not the underlying reasoning process.

Key Novelty

Agentic Continual Pre-training (Agentic CPT)

Inserts a massive pre-training stage (300B+ tokens) between general pre-training and post-training, focused exclusively on agentic data (reasoning, tool use, planning).
Generates synthetic training data without expensive API calls by restructuring static knowledge into 'Question-Planning-Action' tuples (First-order Action Synthesis).
Transforms existing trajectories into multi-step decision-making data by expanding options at each step and explicitly modeling the choice process (Higher-order Action Synthesis).

Architecture

The complete Agentic Training Pipeline, showing the progression from base model through two stages of Agentic CPT to Post-training.

Evaluation Highlights

+10.0% improvement on BrowseComp-en over DeepSeek-V3.1 (39.9 vs 30.0), approaching closed-source OpenAI Deep Research performance.
Achieves 31.5% Pass@1 on the expert-level HLE benchmark, surpassing all reported closed-source models including OpenAI Deep Research (26.6) and Kimi-Researcher (26.9).
Demonstrates strong scaling laws for agentic capabilities, with performance steadily increasing across data volume (up to 315B tokens) and model size (1B to 30B).

Breakthrough Assessment

9/10

Proposes a fundamental shift in the training pipeline for agents (Continual Pre-training instead of just SFT/RL). Achieves SOTA results on difficult benchmarks like HLE and BrowseComp, significantly closing the gap with proprietary models.

⚙️ Technical Details

Problem Definition

Setting: Development of a Deep Research Agent capable of autonomous multi-step reasoning, tool use, and long-horizon planning.

Inputs: Complex user queries requiring information retrieval and synthesis.

Outputs: Comprehensive answers or research reports generated through autonomous tool interaction.

Pipeline Flow

Input Query
Agentic Foundation Model (AgentFounder)
Tool Execution (Search, Browse, Code, etc.)
Iterative Reasoning & Action Loop
Final Answer Generation

System Modules

AgentFounder

Central brain performing planning, reasoning, and tool invocation decisions

Model or implementation: AgentFounder-30B (based on Qwen3-30B)

Tool Set

Execute actions generated by the agent

Model or implementation: External APIs

Novel Architectural Elements

Two-stage Agentic CPT pipeline: Stage 1 (32K context) for general behaviors, Stage 2 (128K context) for long-horizon planning
Data synthesis pipeline (FAS/HAS) designed to create agentic training data without online tool costs

Modeling

Base Model: Qwen3-30B-A3B-Base

Training Method: Agentic Continual Pre-training followed by SFT

Objective Functions:

Purpose: Standard next-token prediction for language modeling.

Formally: L = - sum(log P(x_t+1 | x_1...x_t))

Adaptation: Full model update during CPT and SFT

Training Data:

Agentic CPT Stage 1: ~200B tokens (agent data + knowledge reasoning) with 32K context
Agentic CPT Stage 2: 100B tokens (high-quality agent data) with 128K context
FAS Data: Synthesized from knowledge entities into (Question, Plan, Action) tuples via LLM
HAS Data: Synthesized from trajectories by generating N alternative thoughts/actions per step and modeling the decision process

Key Hyperparameters:

inference_temperature: 0.85
inference_repetition_penalty: 1.1
inference_top_p: 0.95
+ 2 more
max_tool_usage: 128 calls
context_length: 128K tokens

Compute: Not reported in the paper

Comparison to Prior Work

vs. WebSailor: Introduces Continual Pre-training phase rather than just SFT/RL on trajectories
vs. Standard SFT: Uses synthesized 'decision-making' data (HAS) to learn exploration rather than just trajectory cloning
vs. DeepSeek-V3.1: Focuses on scaling agentic data during pre-training rather than general reasoning RL [not cited in paper as direct methodology comparison, but used as baseline]

Limitations

Performance on Chinese benchmarks (BrowseComp-zh) lags behind OpenAI-o3, potentially due to data balance or search tool bias.
Reliance on Google Search tool may affect reproducibility or performance in different regions.
High computational cost for 300B token continual pre-training stage.

Reproducibility

Code: https://github.com/Alibaba-NLP/DeepResearch

Code available at https://github.com/Alibaba-NLP/DeepResearch. Paper describes data synthesis methods (FAS/HAS) in detail. Base models (Qwen3) are public. Training compute/time not specified.

📊 Experiments & Results

Evaluation Setup

Single-agent ReAct paradigm equipped with 5 core tools (Search, Visit, Scholar, Python, File Parser).

Benchmarks:

BrowseComp-en (General web search/browsing)
BrowseComp-zh (General web search/browsing (Chinese))
GAIA (General AI Assistant tasks (Text-only subset))
HLE (Humanity’s Last Exam) (Expert-level multi-subject questions)
DeepResearch Bench (Research report generation)
Frames (Multi-perspective reasoning)

Metrics:

Pass@1
RACE Overall (for DeepResearch Bench)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
AgentFounder outperforms open-source baselines and rivals commercial models on general web search benchmarks.
BrowseComp-en	Pass@1	30.0	39.9	+9.9
BrowseComp-zh	Pass@1	37.5	43.3	+5.8
GAIA	Pass@1	70.5	72.8	+2.3
On difficult, expert-level benchmarks, AgentFounder demonstrates superior reasoning capabilities.
HLE	Pass@1	26.6	31.5	+4.9
DeepResearch Bench	RACE Overall	46.5	47.9	+1.4
Ablation studies confirm the value of Agentic CPT and the specific data synthesis methods.
BrowseComp-en	Pass@1	28.6	39.9	+11.3
BrowseComp-zh	Pass@3	54.3	54.7	+0.4

Main Takeaways

Agentic CPT acts as a universal enhancer: Models fine-tuned on AgentFounder-Base consistently outperform those on Qwen3-Base across different SFT data mixtures.
Scaling laws apply to agentic capabilities: Performance scales logarithmically with training token count (up to 315B) and positively with model size.
Information retrieval tasks benefit more from Agentic CPT than knowledge-intensive tasks, though both show improvement.
Two-stage training (incorporating long-context data in stage 2) is crucial for performance gains.

📚 Prerequisite Knowledge

Prerequisites

Large Language Model pre-training and post-training paradigms (SFT, RL)
Agentic workflows (ReAct, tool use)
Synthetic data generation techniques

Key Terms

Agentic CPT: Agentic Continual Pre-training—an intermediate training stage using massive agent-specific data to align foundation models before fine-tuning.

FAS: First-order Action Synthesis—generating (question, planning, action) data from static knowledge without external tool execution.

HAS: Higher-order Action Synthesis—expanding existing trajectories into multi-choice decision-making data to teach exploration and decision selection.

SFT: Supervised Fine-Tuning—training models on labeled examples to follow instructions.

RL: Reinforcement Learning—optimizing models based on reward signals.

Deep Research Agent: An autonomous agent capable of orchestrating sophisticated workflows (search, browsing, code) to solve complex tasks.

ReAct: Reasoning + Acting—a paradigm where models generate reasoning traces before taking actions.