← Back to Paper List

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

Zhenting Wang, Qi Chang, Hemani Patel, S. Biju, Chen Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, Eugene Siow
Center for Advanced AI, Accenture, University of California, Berkeley
arXiv.org (2025)
Benchmark Agent Reasoning

📝 Paper Summary

Agentic tool use evaluation Multi-call tool use with flexible plan
MCP-Bench evaluates LLM agents on complex, real-world tasks requiring cross-tool coordination and planning by connecting them to 28 live MCP servers with 250 structured tools.
Core Problem
Existing tool-use benchmarks rely on isolated APIs or artificial pipelines that fail to capture the complexity of real-world workflows involving cross-domain coordination, massive goals, and fuzzy instructions.
Why it matters:
  • Current benchmarks like ToolBench and BFCL focus on isolated functionality or short dependency chains, missing the realistic need for long-horizon planning.
  • Real-world agents must handle ambiguous user requests without explicit tool names, a capability not adequately tested by benchmarks that provide specific execution steps.
  • Prior MCP-based benchmarks (MCP-RADER, MCPEval) are too narrow, covering few servers and lacking complex multi-goal objectives.
Concrete Example: A user asks for a 'week-long hiking loop in Denver with weather alerts and hotel options.' Current benchmarks would expect explicit steps. In MCP-Bench, the agent must infer the need to coordinate Google Maps, Weather Data, and National Parks tools, passing outputs (locations) into inputs (forecasts) without explicit instruction.
Key Novelty
Ecosystem-based Benchmarking via Model Context Protocol (MCP)
  • Leverages the standardized MCP interface to connect agents to 28 live, production-grade servers (e.g., finance, science) rather than static API mocks.
  • Synthesizes tasks with 'fuzzy' instructions that strip away tool names, forcing agents to perform retrieval and planning rather than just translating commands.
  • Introduces a multi-faceted evaluation combining rule-based execution checks with a rubric-driven LLM judge that assesses planning efficiency and dependency awareness.
Evaluation Highlights
  • GPT-5 achieves the highest overall score of 0.749, demonstrating superior planning effectiveness (0.749) compared to Llama-3.1-8B-Instruct (0.141).
  • Strong models like o3 and GPT-5 maintain stable performance across single-server and multi-server settings, whereas smaller models like Llama-3.1-8B drop from 0.438 (single) to 0.415 (multi).
  • While schema understanding has converged (most models >95% valid tool naming), planning remains the key differentiator, with GPT-5 scoring 0.761 in dependency awareness vs. 0.337 for Llama-3.1-8B.
Breakthrough Assessment
8/10
Significantly advances tool-use benchmarking by moving from isolated APIs to connected ecosystems via MCP. The focus on fuzzy instructions and cross-server dependencies addresses a critical gap in agent evaluation.
×