← Back to Paper List

Survey on Evaluation of LLM-based Agents

Asaf Yehudai, Lilach Eden, Alan Li, Guy Uziel, Yilun Zhao, Roy Bar-Haim, Arman Cohan, Michal Shmueli-Scheuer
The Hebrew University of Jerusalem, IBM Research, Yale University
arXiv.org (2025)
Agent Benchmark Reasoning Memory

📝 Paper Summary

Agent Evaluation Benchmarks
This paper surveys the landscape of LLM-based agent evaluation, categorizing benchmarks into fundamental capabilities, application-specific domains, and generalist tasks, while identifying gaps in cost, safety, and robustness assessment.
Core Problem
Standard LLM benchmarks (like MMLU) are insufficient for evaluating agents because agents operate sequentially in dynamic environments, maintain state, and use tools, introducing complexity beyond static text-to-text inference.
Why it matters:
  • Agents are increasingly applied to complex real-world tasks (software engineering, web navigation) where simple accuracy metrics fail to capture risks or efficiency
  • Existing evaluation methods are fragmented, making it difficult for developers to choose appropriate benchmarks for specific agentic capabilities like planning or memory
  • Current benchmarks often lag behind agent capabilities, lacking the realism and dynamic feedback loops necessary to test autonomous systems effectively
Concrete Example: In tool-use evaluation, early benchmarks like ToolBench only assessed simple one-step interactions with explicit parameters. They failed to capture real-world complexities like multi-step conversations where parameters are implicit, or scenarios requiring state management across a long trajectory, which newer benchmarks like ToolSandbox address.
Key Novelty
Comprehensive Taxonomy of Agent Evaluation
  • Systematically categorizes evaluation into four dimensions: fundamental capabilities (planning, memory), application-specific domains (web, code, science), generalist agents, and development frameworks
  • Maps the evolution from static datasets to dynamic, gym-like environments where agents receive environmental feedback rather than just comparing text outputs
Evaluation Highlights
  • Identifies over 50 specific benchmarks across domains, including specialized evaluations for planning (e.g., PlanBench), tool use (e.g., BFCL), and memory (e.g., StreamBench)
  • Highlights the shift from static text benchmarks to dynamic environments like OSWorld and WebArena that evaluate end-to-end task completion rates rather than multiple-choice accuracy
  • Reveals critical gaps in current evaluation: lack of standardized metrics for cost-efficiency, safety compliance, and robustness against errors in long-horizon tasks
Breakthrough Assessment
7/10
A highly useful structured survey that organizes a chaotic field. While it doesn't propose a new method, its taxonomy and identification of trends (like the shift to live benchmarks) provide a strong foundation for future research.
×