Evaluation Setup
Agentic tool-use tasks interacting with real MCP servers
Benchmarks:
- MCP-Atlas (Multi-turn tool orchestration) [New]
Metrics:
- Pass Rate (Claims Coverage > 0.75)
- Coverage Score (Average fraction of claims fulfilled)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| MCP-Atlas Subset |
Agreement with Human Majority |
100 |
78 |
-22
|
Main Takeaways
- Current frontier models achieve just over 50% pass rate, indicating substantial headroom for improvement in complex tool orchestration.
- A significant performance gap exists between top models and next-best models (20-40% pass rate), highlighting high variance in agentic capabilities.
- Primary failure modes are 'Tool Usage' (incorrect server selection/parameters) and 'Task Understanding' (premature stopping), validating the difficulty of the 'unknown-tools' setting.
- Approximately 1/3 of tasks require conditional branching, and the vast majority require cross-server orchestration, confirming the benchmark's complexity.