MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

📝 Paper Summary

Agentic tool use evaluation Multi-call tool use with flexible plan

MCP-Bench evaluates LLM agents on complex, real-world tasks requiring cross-tool coordination and planning by connecting them to 28 live MCP servers with 250 structured tools.

Core Problem

Existing tool-use benchmarks rely on isolated APIs or artificial pipelines that fail to capture the complexity of real-world workflows involving cross-domain coordination, massive goals, and fuzzy instructions.

Why it matters:

Current benchmarks like ToolBench and BFCL focus on isolated functionality or short dependency chains, missing the realistic need for long-horizon planning.
Real-world agents must handle ambiguous user requests without explicit tool names, a capability not adequately tested by benchmarks that provide specific execution steps.
Prior MCP-based benchmarks (MCP-RADER, MCPEval) are too narrow, covering few servers and lacking complex multi-goal objectives.

Concrete Example: A user asks for a 'week-long hiking loop in Denver with weather alerts and hotel options.' Current benchmarks would expect explicit steps. In MCP-Bench, the agent must infer the need to coordinate Google Maps, Weather Data, and National Parks tools, passing outputs (locations) into inputs (forecasts) without explicit instruction.

Key Novelty

Ecosystem-based Benchmarking via Model Context Protocol (MCP)

Leverages the standardized MCP interface to connect agents to 28 live, production-grade servers (e.g., finance, science) rather than static API mocks.
Synthesizes tasks with 'fuzzy' instructions that strip away tool names, forcing agents to perform retrieval and planning rather than just translating commands.
Introduces a multi-faceted evaluation combining rule-based execution checks with a rubric-driven LLM judge that assesses planning efficiency and dependency awareness.

Evaluation Highlights

GPT-5 achieves the highest overall score of 0.749, demonstrating superior planning effectiveness (0.749) compared to Llama-3.1-8B-Instruct (0.141).
Strong models like o3 and GPT-5 maintain stable performance across single-server and multi-server settings, whereas smaller models like Llama-3.1-8B drop from 0.438 (single) to 0.415 (multi).
While schema understanding has converged (most models >95% valid tool naming), planning remains the key differentiator, with GPT-5 scoring 0.761 in dependency awareness vs. 0.337 for Llama-3.1-8B.

Breakthrough Assessment

8/10

Significantly advances tool-use benchmarking by moving from isolated APIs to connected ecosystems via MCP. The focus on fuzzy instructions and cross-server dependencies addresses a critical gap in agent evaluation.

⚙️ Technical Details

Problem Definition

Setting: Partially Observable Markov Decision Process (POMDP) tuple (S, A, O, T, R, U, Σ) where agents interact with a set of MCP servers Σ.

Inputs: Natural language task instruction u and a set of available MCP servers Σ.

Outputs: Final answer answer, execution layers L, and execution trajectory trajectory.

Pipeline Flow

Task Synthesis (LLM generates dependency chains and fuzzy instructions)
Agent Execution (Multi-turn planning and tool invocation)
Evaluation (Rule-based checks + LLM Judge)

System Modules

Task Synthesizer

Analyzes tool I/O signatures to discover dependency chains and generates natural language tasks

Model or implementation: o4-mini

Agent Executor

Plans and executes tools over multiple turns to solve the task

Model or implementation: Various target LLMs (e.g., GPT-4o, Llama-3)

Evaluator

Assesses performance using rules and LLM judging

Model or implementation: o4-mini (for judge)

Novel Architectural Elements

Integration of 28 live MCP servers as a unified environment for agent benchmarking
Automated fuzzy task synthesis pipeline that preserves numerical constraints while removing explicit tool references

Modeling

Base Model: Evaluated 20 models including GPT-5, o3, Llama-3.1, Claude 3.5 Sonnet, etc.

Training Method: Benchmarking only (no training proposed)

Compute: Evaluation runs up to 20 rounds per task. Specific GPU requirements for inference not reported.

Comparison to Prior Work

vs. ToolBench: MCP-Bench uses complementary tools within servers designed to work together, enabling deeper dependencies.
vs. Tau-Bench: MCP-Bench scales to 28 domains and 250 tools vs. Tau-Bench's limited scope.
vs. MCPEval: MCP-Bench introduces fuzzy instructions and massive goals (multi-goal objectives) rather than explicit tool specifications.
+ 1 more
vs. Gorilla [not cited in paper]: Gorilla focuses on API retrieval fine-tuning; MCP-Bench evaluates general agentic planning and execution in a standardized protocol ecosystem.

Limitations

Reliance on o4-mini as the primary LLM judge may introduce bias despite prompt shuffling.
Live MCP servers may change or become unavailable, potentially affecting long-term reproducibility.
Computational cost is high due to multi-turn execution (up to 20 rounds) and LLM-based evaluation.

Reproducibility

Code: https://github.com/Accenture/mcp-bench

Code and data available at https://github.com/Accenture/mcp-bench. The benchmark uses live MCP servers, so reproducibility depends on the stability of these external services (though 28 representative servers are used).

📊 Experiments & Results

Evaluation Setup

Agents interact with MCP servers to solve 104 synthesized tasks (56 single-server, 48 multi-server).

Benchmarks:

MCP-Bench (Tool-use and planning) [New]

Metrics:

Valid Tool Name Rate
Schema Compliance Rate
Execution Success Rate
Task Completion Quality (LLM Judge)
Tool Usage Quality (LLM Judge)
Planning Effectiveness (LLM Judge)
Statistical methodology: Prompt shuffling (5 permutations) and score averaging used for LLM judge to reduce variance.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Overall leaderboard performance shows a clear hierarchy with GPT-5 and o3 leading, particularly in planning capabilities.
MCP-Bench	Overall Score	0.428	0.749	+0.321
MCP-Bench	Planning Effectiveness (Dependency Awareness)	0.221	0.649	+0.428
MCP-Bench	Schema Compliance	89.4	99.3	+9.9
Multi-server settings degrade performance for weaker models while strong models remain robust.
MCP-Bench (Multi-server)	Overall Score	0.438	0.415	-0.023
MCP-Bench	Average # Tool Calls	155.6	78.9	-76.7

Main Takeaways

Basic execution fidelity (schema compliance) has largely converged, with most models scoring >95%.
The primary differentiator is long-horizon planning and cross-server orchestration; top models (GPT-5, o3) excel here while smaller models fail.
Performance on multi-server tasks drops for weaker models due to difficulties in maintaining dependency chains across distributed tools.
Efficient agents (like GPT-4o) solve tasks with significantly fewer tool calls (21.8 avg) compared to struggling agents (Llama-3.1-8B with 155.6 avg).

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Model (LLM) agents
Familiarity with API tool calling and JSON schemas
Basic knowledge of reinforcement learning / POMDP formulation

Key Terms

MCP: Model Context Protocol—an open standard that enables consistent connection between AI assistants and systems (data, tools, prompts).

Fuzzy Instructions: Task descriptions that state high-level goals without specifying tool names or execution steps, requiring the agent to infer the workflow.

Dependency Chain: A sequence of tool invocations where the output of one tool is required as the input for a subsequent tool.

POMDP: Partially Observable Markov Decision Process—a mathematical framework for modeling decision-making where the agent cannot directly observe the full state of the environment.

LLM-as-a-Judge: Using a strong LLM to evaluate the quality of another model's outputs based on specific rubrics.

Schema Compliance: Adherence to the formal structure (data types, required fields) defined by a tool's API specification.

Distractor Servers: Additional MCP servers provided to the agent that are irrelevant to the current task, testing the agent's ability to filter noise.