Evaluation Setup
Single-agent ReAct paradigm equipped with 5 core tools (Search, Visit, Scholar, Python, File Parser).
Benchmarks:
- BrowseComp-en (General web search/browsing)
- BrowseComp-zh (General web search/browsing (Chinese))
- GAIA (General AI Assistant tasks (Text-only subset))
- HLE (Humanity’s Last Exam) (Expert-level multi-subject questions)
- DeepResearch Bench (Research report generation)
- Frames (Multi-perspective reasoning)
Metrics:
- Pass@1
- RACE Overall (for DeepResearch Bench)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| AgentFounder outperforms open-source baselines and rivals commercial models on general web search benchmarks. |
| BrowseComp-en |
Pass@1 |
30.0 |
39.9 |
+9.9
|
| BrowseComp-zh |
Pass@1 |
37.5 |
43.3 |
+5.8
|
| GAIA |
Pass@1 |
70.5 |
72.8 |
+2.3
|
| On difficult, expert-level benchmarks, AgentFounder demonstrates superior reasoning capabilities. |
| HLE |
Pass@1 |
26.6 |
31.5 |
+4.9
|
| DeepResearch Bench |
RACE Overall |
46.5 |
47.9 |
+1.4
|
| Ablation studies confirm the value of Agentic CPT and the specific data synthesis methods. |
| BrowseComp-en |
Pass@1 |
28.6 |
39.9 |
+11.3
|
| BrowseComp-zh |
Pass@3 |
54.3 |
54.7 |
+0.4
|
Main Takeaways
- Agentic CPT acts as a universal enhancer: Models fine-tuned on AgentFounder-Base consistently outperform those on Qwen3-Base across different SFT data mixtures.
- Scaling laws apply to agentic capabilities: Performance scales logarithmically with training token count (up to 315B) and positively with model size.
- Information retrieval tasks benefit more from Agentic CPT than knowledge-intensive tasks, though both show improvement.
- Two-stage training (incorporating long-context data in stage 2) is crucial for performance gains.