← Back to Paper List

OAgents: An Empirical Study of Building Effective Agents

He Zhu, Tianrui Qin, King Zhu, Heyuan Huang, Yeyi Guan, Jinxiang Xia, Yi Yao, Hanhao Li, Ningning Wang, Pai Liu, Tianhao Peng, Xin Gui, Xiaowan Li, Yuhui Liu, Y. Jiang, Jun Wang, Changwang Zhang, Xiangru Tang, Ge Zhang, Jian Yang, Minghao Liu, Xitong Gao, Jiaheng Liu, Wangchunshu Zhou
OPPO Personal AI
Conference on Empirical Methods in Natural Language Processing (2025)
Agent Memory Benchmark Reasoning MM

📝 Paper Summary

Agent Frameworks Agentic Planning and Memory Benchmarking
The paper conducts a systematic empirical study to identify critical design choices for language agents, proposing a robust evaluation protocol and the modular OAgents framework which optimizes planning, memory, and tool use.
Core Problem
Current agent research lacks standardization and scientific rigor, with unstandardized components (planning, memory, tools) and evaluation protocols causing high variance and poor reproducibility.
Why it matters:
  • Lack of standardization makes it impossible to attribute performance improvements to specific innovations versus engineering tricks or random variance.
  • Inconsistent evaluation settings (e.g., number of runs, error handling) prevent fair comparisons across different frameworks on benchmarks like GAIA.
  • The fragmentation of design choices undermines scientific progress, as findings cannot be reliably compared or built upon.
Concrete Example: Previous works on the GAIA benchmark often merge results from multiple runs but report them as 'pass@1', or fail to disclose specific tool implementations, leading to results that are irreproducible by other researchers.
Key Novelty
Dual-Axis Analysis (FAC & LRF) & Modular Framework
  • Decomposes agent design into Factual Acquisition Capacity (FAC) for gathering external knowledge and Logical Reasoning Fidelity (LRF) for consistent decision-making.
  • Introduces OAgents, a modular framework integrating periodical plan revision, fine-grained task decomposition, optimized multi-source web browsing, and adaptive memory.
  • Proposes a standardized evaluation protocol (e.g., majority voting, specific inference parameters) to reduce experimental variance and ensure fair comparisons.
Evaluation Highlights
  • Ranks 1st among open-source agent frameworks on the GAIA benchmark.
  • Achieves state-of-the-art performance among open-source projects on BrowseComp.
  • Demonstrates that standardized evaluation protocols significantly stabilize comparisons compared to previous high-variance reporting.
Breakthrough Assessment
8/10
Strong contribution to scientific rigor in a chaotic field. The empirical study clears up best practices, and the resulting framework achieves SOTA on key benchmarks.
×