MCPAgentBench: A Real-world Task Benchmark for Evaluating LLM Agent MCP Tool Use

📝 Paper Summary

Agentic AI Tool Use Benchmarks

MCPAgentBench evaluates LLM agents on local, authentic Model Context Protocol (MCP) tasks, focusing on execution efficiency and robustness against distractors rather than just correctness.

Core Problem

Existing MCP benchmarks rely on unstable remote servers, lack granular difficulty awareness, and focus primarily on correctness, ignoring the execution efficiency (time and token cost) of agentic workflows.

Why it matters:

Dependency on remote MCP servers causes instability and poor reproducibility in benchmarks
Current evaluations fail to measure resource waste (excess tokens/time) even when tasks are completed correctly
Models need to be tested on their ability to select the right tools from large, distractor-filled lists, mimicking real-world 'needle in a haystack' scenarios

Concrete Example: A model might correctly solve a task requiring two parallel searches. However, instead of calling them concurrently (parallel), it calls them one by one (serial), doubling execution time and token usage. Existing benchmarks score this as 'correct', missing the inefficiency.

Key Novelty

Efficiency-Centric Local MCP Benchmark

Reconstructs real-world MCP tools as local Python mock code to ensure deterministic, stable evaluation without remote server dependencies
Introduces 'Task Efficiency Finish Score' (TEFS) to penalize agents that solve tasks correctly but use inefficient strategies (e.g., serializing parallelizable sub-tasks)
Dynamic sandbox environment that injects distractors (unrelated tools) into the candidate list to test tool discrimination robustness

Architecture

The overall architecture of MCPAgentBench, including data collection, the sandbox environment, and the evaluation loop.

Evaluation Highlights

Claude Sonnet 4.5 achieves the highest efficiency score (57.7 TEFS), outperforming the next best model (glm-4.6) by +3.3 points
OpenAI models (e.g., gpt-5) score 0 on Dual Parallel tasks under the efficiency metric (TEFS) because they default to extreme serial execution despite solving the task correctly
Efficiency penalty is significant: gpt-o3 sees a massive 28.5 point drop when moving from correctness score (TFS) to efficiency score (TEFS)

Breakthrough Assessment

8/10

Significant contribution by standardizing local MCP evaluation and exposing the 'correct but inefficient' failure mode of current SOTA models, particularly regarding parallel execution.

⚙️ Technical Details

Problem Definition

Setting: Agentic tool use evaluation using Model Context Protocol (MCP)

Inputs: Natural language task T and a list of candidate MCP tools L (containing n correct tools and K-n distractors)

Outputs: Sequence of tool invocations P (tool names and parameters)

Pipeline Flow

Task Loader (Selects task T)
Context Builder (Retrieves correct tools G + samples distractors F -> List L)
Agent Interaction (Agent receives T and L, generates calls)
Sandbox Execution (Autogen executes calls locally via mock code)
Scorer (Compares execution trace P against gold solution G)

System Modules

Task Loader (Input Processing)

Loads one of 178 curated tasks with unique solutions

Model or implementation: Deterministic logic

Context Builder (Input Processing)

Constructs the tool context window for the agent

Model or implementation: Deterministic sampling

Agent (LLM under test)

Plans and generates tool calls

Model or implementation: Various (Claude, GPT-4, etc.)

Automated Evaluation Sandbox

Executes tool calls locally and records trace

Model or implementation: Autogen-based sandbox

Novel Architectural Elements

Local MCP Reconstruction: Conversion of 20,000+ remote MCP tool definitions into local executable Python mocks to remove network dependencies
Efficiency-Aware Scoring Logic: Pipeline strictly enforces parallel vs. serial execution order in the evaluation phase (TEFS)

Modeling

Base Model: Benchmark framework (evaluates external models like Claude Sonnet 4.5, GPT-4o, etc.)

Comparison to Prior Work

vs. MCP-Universe: Uses local mocks for stability vs. live remote servers
vs. MCPToolBench++: Focuses on execution efficiency (TEFS) and parallelization vs. error taxonomy and scale
vs. ToolBench: Aligned with standardized Model Context Protocol (MCP) vs. heterogeneous APIs
+ 1 more
vs. API-Bank [not cited in paper]: Emphasizes parallel/serial logic correctness vs. just dialogue/retrieval correctness

Limitations

Benchmark size is relatively small (178 tasks) compared to training sets, though manually curated for quality.
Time efficiency metrics are subject to API latency variations despite using official endpoints.
Requires unique solutions for strict matching, which may penalize valid but alternative tool usage strategies not foreseen by annotators.
Mock code generation for tools relies on GPT-4o and manual review, potential for subtle divergences from real API behavior.

Reproducibility

Code: https://github.com/MCPAgentBench/MCPAgentBench

Publicly available code (https://github.com/MCPAgentBench/MCPAgentBench) and dataset. The benchmark includes 178 high-quality manually curated task instances and local mock implementations of MCP tools. Requires API keys for the models being evaluated.

📊 Experiments & Results

Evaluation Setup

Sandbox environment using Autogen to manage agent-tool interaction. Agents face a list of N=20 candidate tools.

Benchmarks:

MCPAgentBench (Tool Use / Agentic Planning) [New]

Metrics:

Task Finish Score (TFS)
Task Efficiency Finish Score (TEFS)
Time Efficiency (Score/minute)
Token Efficiency (Score/1k tokens)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Overall performance comparison showing the gap between simple task completion (TFS) and efficient execution (TEFS).
MCPAgentBench	TFS	48.1	71.6	+23.5
MCPAgentBench	TEFS	33.5	57.7	+24.2
MCPAgentBench	TEFS	57.7	39.4	-18.3
Specific analysis of parallel task performance reveals extreme strategic differences between models.
MCPAgentBench (Dual Parallel)	TEFS	100.0	0.0	-100.0
Efficiency metrics regarding token consumption and time.
MCPAgentBench	Token Efficiency	lowest	highest	positive

Experiment Figures

Comparison of TFS (Task Finish Score) and TEFS (Task Efficiency Finish Score) across 11 models.

Impact of model size and tool count on TEFS.

Main Takeaways

Major gap exists between Correctness (TFS) and Efficiency (TEFS): Models often solve tasks but use wasteful serial strategies instead of parallel ones.
OpenAI models exhibit an 'extreme serial' bias, resulting in 0 scores on parallel tasks under efficiency metrics.
Claude Sonnet 4.5 adopts an aggressive parallel strategy, boosting its efficiency score but sometimes misapplying parallelism to serial tasks.
TEFS scores generally correlate with model size, but performance decreases as the number of distractor tools increases.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM Agents and Tool Use
Familiarity with Model Context Protocol (MCP)
Basic knowledge of sandbox environments

Key Terms

MCP: Model Context Protocol—a standard defining how AI agents discover and invoke external tools

TEFS: Task Efficiency Finish Score—a metric measuring if a task is both completed correctly AND executed with the optimal parallel/serial logic

TFS: Task Finish Score—a metric measuring only if the final tool calls and parameters match the solution, ignoring execution order/efficiency

Distractor Tools: Irrelevant or confusing tools added to the candidate list to test the agent's ability to filter and select the correct tool

Dual Parallel Invocation: A task type where two tools should be called simultaneously/independently rather than sequentially

Dual Serial Invocation: A task type where tool calls must happen in a specific sequence (output of tool A is input for tool B)

Autogen: A framework for enabling next-generation LLM applications with multiple agents that can converse with each other