MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

📝 Paper Summary

Multi-call tool use with flexible plan Benchmark datasets

MCP-Atlas evaluates LLM agents on realistic, multi-step tasks using live Model Context Protocol servers via a claims-based rubric that scores factual content rather than rigid execution paths.

Core Problem

Existing tool-use benchmarks rely on mock servers, simplistic workflows, or subjective LLM-as-a-judge scoring, failing to capture the complexity of real-world discovery, parameterization, and error recovery.

Why it matters:

Real deployments require agents to orchestrate tools across multiple servers and handle rate limits or authentic errors, which mock servers mask
Subjective or trajectory-based scoring penalizes valid alternative solutions, making it hard to reliably measure progress
Current benchmarks often lack 'unknown tool' friction, exposing only correct tools and missing the critical challenge of discovery among distractors

Concrete Example: A task might require integrating financial APIs with news retrieval. An agent must discover the correct tools from a set including distractors (e.g., distinguishing 'maps_distance_matrix' from 'maps_geocode'), handle parameter errors, and synthesize a final answer grounded in those outputs—complexities missed by static Q&A.

Key Novelty

Claims-Based Evaluation on Real MCP Servers

Utilizes 36 real, containerized Model Context Protocol (MCP) servers (not mocks) to test actual API interaction and error handling
Evaluates success via a 'claims list'—a set of atomic, verifiable facts the final answer must contain—allowing for partial credit and trajectory independence
Systemmatically includes 5-10 plausible 'distractor' tools per task to rigorously test tool discovery and selection capabilities

Evaluation Highlights

Top frontier models achieve pass rates >50% on the full 1,000-task benchmark
Next-best models lag significantly, scoring in the 20-40% range, indicating high variance in tool-use competency
Automated claims-based scoring achieves 78% agreement with human judges, validating the rubric's reliability

Breakthrough Assessment

9/10

Addresses a critical gap in agentic evaluation by moving away from mocks to real servers and solving the scoring objectivity problem with claims-based verification. High practical utility.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of Large Language Model (LLM) agents on single-turn, multi-step tool-use tasks

Inputs: Natural language prompt, Set of exposed tools (Target + Distractors)

Outputs: Sequence of tool calls and a Final Answer synthesizing the results

Pipeline Flow

Environment Setup (Containerized MCP Servers)
Task Sampling (Prompt + Tools + Distractors)
Agent Execution (Tool Calls & Responses)
Evaluation (Claims Verification)

System Modules

MCP Host Environment

Host 36 real MCP servers in isolated containers with allow-listed egress

Model or implementation: Real Implementation (e.g., Google Maps, Slack, Linear)

Evaluation Harness (Evaluation)

Manage the interaction loop, inject distractors, and capture the agent's final answer

Model or implementation: Containerized Harness

Claims Verifier (Evaluation)

Score the final answer by checking for the presence of required factual claims

Model or implementation: Gemini 2.5 Pro (Judge)

Novel Architectural Elements

Integration of live MCP servers into a scalable evaluation harness, replacing static mocks
Claims-based verification pipeline that decomposes answers into atomic facts for granular partial-credit scoring

Modeling

Base Model: Gemini 2.5 Pro (used as Judge)

Compute: Not reported in the paper

Comparison to Prior Work

vs. MCP-Universe: Scales to 1,000 tasks and uses claims-based scoring rather than execution-only verification
vs. MCPEval: Uses 100% real servers instead of synthetic mocks
vs. MCP-Bench: Uses atomic claims verification instead of holistic judging to reduce subjectivity
+ 1 more
vs. ToolBench [not cited in paper]: Focuses specifically on the MCP standard and real-time server interaction rather than static REST API calls

Limitations

Dependency on the judge model (Gemini 2.5 Pro) for accurate claims verification
The benchmark requires a containerized environment to run real servers, increasing computational overhead compared to static text benchmarks
Evaluating on live servers may introduce non-deterministic behavior (e.g., external API outages or updates) impacting reproducibility over time

Reproducibility

The paper mentions releasing a 500-task public subset, the task schema, and the containerized harness to facilitate reproduction. 500 tasks are held out for leaderboard integrity. No specific GitHub URL is provided in the text.

📊 Experiments & Results

Evaluation Setup

Agentic tool-use tasks interacting with real MCP servers

Benchmarks:

MCP-Atlas (Multi-turn tool orchestration) [New]

Metrics:

Pass Rate (Claims Coverage > 0.75)
Coverage Score (Average fraction of claims fulfilled)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MCP-Atlas Subset	Agreement with Human Majority	100	78	-22

Main Takeaways

Current frontier models achieve just over 50% pass rate, indicating substantial headroom for improvement in complex tool orchestration.
A significant performance gap exists between top models and next-best models (20-40% pass rate), highlighting high variance in agentic capabilities.
Primary failure modes are 'Tool Usage' (incorrect server selection/parameters) and 'Task Understanding' (premature stopping), validating the difficulty of the 'unknown-tools' setting.
Approximately 1/3 of tasks require conditional branching, and the vast majority require cross-server orchestration, confirming the benchmark's complexity.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM tool-use/function-calling
Basic knowledge of API client/server architectures

Key Terms

MCP: Model Context Protocol—a standard interface enabling LLMs to discover and invoke external tools and resources in a uniform way

Claims-based rubric: An evaluation method where the correctness of a response is measured by the presence of specific, independently verifiable factual claims derived from ground truth

Distractors: Plausible but incorrect tools included in the agent's context to test its ability to identify the right tool for the job

Reference Trajectory: The minimal sequence of tool calls required to solve a task, used here for diagnostics rather than strict pass/fail scoring

Gemini 2.5 Pro: The specific LLM used as the automated judge to verify claims in this benchmark