← Back to Paper List

MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use

Fei Lei, Yibo Yang, Wenxiu Sun, Dahua Lin
SenseTime Research
arXiv (2025)
Agent Benchmark Reasoning

📝 Paper Summary

Benchmark Agentic Tool Use
MCPVerse evaluates LLM agents using over 550 executable real-world tools via the Model Context Protocol, testing their ability to navigate vast action spaces and solve time-sensitive tasks.
Core Problem
Existing tool-use benchmarks rely on artificial/mock tools or constrain action spaces to small subsets due to context limits, failing to test if agents can navigate complex, real-world environments.
Why it matters:
  • Mock tools (e.g., simplified weather APIs) allow models to memorize superficial patterns rather than learn robust planning required for production systems
  • Constrained action spaces (mounting only ~10 tools per query) prevent assessing an agent's ability to explore and exploit vast solution spaces effectively
  • Lack of real-time execution in prior benchmarks limits evaluation to 'correct tool name prediction' rather than functional success
Concrete Example: In standard benchmarks, a model might just select 'WeatherAPI' from a list of 5 options. In MCPVerse, the model must choose from 552 tools (loading 147k tokens of schemas), potentially combining a 'FlightRadar' tool with a 'Google Maps' tool to answer a complex travel query, where the correct path isn't obvious.
Key Novelty
Massive-Scale Real-World Tool Benchmark via MCP
  • Integrates 65 MCP servers providing 552 unique executable tools, creating an action space of over 147k tokens—far larger than typical benchmarks
  • Uses the Model Context Protocol (MCP) as a standardized interface to connect LLMs to diverse real-world systems like file systems, databases, and flight trackers
  • Employs 'Max-Scale Mode' where all 552 tools are loaded simultaneously into the context, forcing the agent to discern relevant tools from hundreds of distractors
Evaluation Highlights
  • Claude-4-Sonnet achieves only 44.2% success rate in Max-Scale mode (all 65 MCPs loaded), indicating significant room for improvement
  • Agentic models like Claude-4-Sonnet and GLM-4.5 perform better in Standard Mode (32 MCPs) than Oracle Mode (minimal set), suggesting larger tool spaces allow emergent 'hacking' solutions
  • Many SOTA models fail at scale: DeepSeek-V3 is limited by 64k context, while GPT-4o and Gemini-2.5-Pro hit tool-count limits (128 and 512 tools respectively)
Breakthrough Assessment
9/10
Sets a new standard for tool-use benchmarking by moving away from mock APIs to hundreds of real executable tools. The 'Max-Scale' setting pushes the boundaries of context windows and agentic reasoning.
×