Evaluation Setup
Agentic tasks involving Web Search, Web Browser, and Code Executor.
Benchmarks:
- GAIA (General AI Assistant (Level 1-3))
- Humanity's Last Exam (HLE) (Hard reasoning/knowledge)
- WebWalkerQA (Web navigation and QA)
- HotpotQA (Multi-hop QA)
Metrics:
- Pass@1
- Pass@5
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance comparison on General Web Agent Benchmarks (Pass@1) shows AEPO consistently outperforming baselines. |
| WebWalkerQA |
Pass@1 |
39.0 |
43.0 |
+4.0
|
| Humanity's Last Exam |
Pass@1 |
10.4 |
11.2 |
+0.8
|
| Pass@5 results demonstrate AEPO's ability to generate diverse and correct solutions. |
| Humanity's Last Exam |
Pass@5 |
22.2 |
26.0 |
+3.8
|
Main Takeaways
- AEPO consistently outperforms 7 mainstream RL algorithms across 14 datasets.
- The method is particularly effective on complex, long-horizon tasks like GAIA and WebWalkerQA where diverse tool use is critical.
- Analysis reveals AEPO maintains higher and more stable policy entropy throughout training compared to baselines, preventing collapse.
- Ablation studies confirm both the dynamic rollout and the entropy-balanced policy update are necessary for optimal performance.