← Back to Paper List

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

Chaithanya Bandi, Ben Hertzberg, Geobio Boo, Tejas Polakam, Jeff Da, Sami Hassaan, Manasi Sharma, Andrew Park, Ernesto Hernandez, Dan Rambado, Ivan Salazar, Rafael Cruz, Chetan Rane, Ben Levin, Brad Kenstler, Bing Liu
Scale AI
arXiv (2026)
Agent Benchmark

📝 Paper Summary

Multi-call tool use with flexible plan Benchmark datasets
MCP-Atlas evaluates LLM agents on realistic, multi-step tasks using live Model Context Protocol servers via a claims-based rubric that scores factual content rather than rigid execution paths.
Core Problem
Existing tool-use benchmarks rely on mock servers, simplistic workflows, or subjective LLM-as-a-judge scoring, failing to capture the complexity of real-world discovery, parameterization, and error recovery.
Why it matters:
  • Real deployments require agents to orchestrate tools across multiple servers and handle rate limits or authentic errors, which mock servers mask
  • Subjective or trajectory-based scoring penalizes valid alternative solutions, making it hard to reliably measure progress
  • Current benchmarks often lack 'unknown tool' friction, exposing only correct tools and missing the critical challenge of discovery among distractors
Concrete Example: A task might require integrating financial APIs with news retrieval. An agent must discover the correct tools from a set including distractors (e.g., distinguishing 'maps_distance_matrix' from 'maps_geocode'), handle parameter errors, and synthesize a final answer grounded in those outputs—complexities missed by static Q&A.
Key Novelty
Claims-Based Evaluation on Real MCP Servers
  • Utilizes 36 real, containerized Model Context Protocol (MCP) servers (not mocks) to test actual API interaction and error handling
  • Evaluates success via a 'claims list'—a set of atomic, verifiable facts the final answer must contain—allowing for partial credit and trajectory independence
  • Systemmatically includes 5-10 plausible 'distractor' tools per task to rigorously test tool discovery and selection capabilities
Evaluation Highlights
  • Top frontier models achieve pass rates >50% on the full 1,000-task benchmark
  • Next-best models lag significantly, scoring in the 20-40% range, indicating high variance in tool-use competency
  • Automated claims-based scoring achieves 78% agreement with human judges, validating the rubric's reliability
Breakthrough Assessment
9/10
Addresses a critical gap in agentic evaluation by moving away from mocks to real servers and solving the scoring objectivity problem with claims-based verification. High practical utility.
×