← Back to Paper List

MASEval: Extending Multi-Agent Evaluation from Models to Systems

Cornelius Emde, Alexander Rubinstein, Anmol Goel, Ahmed Heakl, Sangdoo Yun, Seong Joon Oh, Martin Gubri
Parameter Lab, University of Oxford, University of Tübingen, NAVER AI Lab, KAIST
arXiv (2026)
Agent Benchmark

📝 Paper Summary

Multi-agent evaluation Agent frameworks
MASEval provides a framework-agnostic infrastructure to evaluate multi-agent systems as complete units, revealing that orchestration implementation choices impact performance as significantly as the underlying model capabilities.
Core Problem
Existing benchmarks are model-centric, fixing the agent scaffold and ignoring how system-level decisions (topology, orchestration, error handling) impact performance, leaving practitioners without guidance on framework choice.
Why it matters:
  • Practitioners lack data-driven guidance on which agent framework (e.g., LangGraph vs. smolagents) best suits their use case
  • Researchers cannot easily isolate the impact of design decisions like communication topology versus model capability
  • Benchmark consumers face fragmented interfaces requiring significant boilerplate to test agents across multiple datasets
Concrete Example: A user implementing a travel agent might find that GPT-5-mini fails on the MACS benchmark when using smolagents because the framework forces a tool call every step, causing the model to loop endlessly on clarification questions, whereas the same model succeeds with LlamaIndex.
Key Novelty
System-Level Evaluation Infrastructure (MASEval)
  • Decouples the system under test from the benchmark harness using adapters, allowing any agent framework to be evaluated on any benchmark
  • Treats the entire system (agents + framework + coordination logic) as the unit of analysis rather than just the model
  • Standardizes the evaluation lifecycle (Setup → Execute → Collect → Evaluate) to reduce boilerplate for benchmark producers and consumers
Evaluation Highlights
  • Framework choice creates a performance range of 12.4 percentage points (pp) on average, comparable to the 14.2 pp range driven by model choice
  • Claude-Haiku-4.5 performance on MACS Travel swings by 30.9 pp depending solely on the framework (90.4% with smolagents vs. 59.5% with LlamaIndex)
  • Reduces interface boilerplate code by 83–91% for benchmark consumers compared to original benchmark implementations
Breakthrough Assessment
9/10
MASEval fundamentally shifts the unit of analysis from models to systems, exposing a critical blind spot in current evaluation. Its infrastructure significantly lowers the barrier for rigorous cross-framework comparison.
×