MASEval: Extending Multi-Agent Evaluation from Models to Systems

📝 Paper Summary

Multi-agent evaluation Agent frameworks

MASEval provides a framework-agnostic infrastructure to evaluate multi-agent systems as complete units, revealing that orchestration implementation choices impact performance as significantly as the underlying model capabilities.

Core Problem

Existing benchmarks are model-centric, fixing the agent scaffold and ignoring how system-level decisions (topology, orchestration, error handling) impact performance, leaving practitioners without guidance on framework choice.

Why it matters:

Practitioners lack data-driven guidance on which agent framework (e.g., LangGraph vs. smolagents) best suits their use case
Researchers cannot easily isolate the impact of design decisions like communication topology versus model capability
Benchmark consumers face fragmented interfaces requiring significant boilerplate to test agents across multiple datasets

Concrete Example: A user implementing a travel agent might find that GPT-5-mini fails on the MACS benchmark when using smolagents because the framework forces a tool call every step, causing the model to loop endlessly on clarification questions, whereas the same model succeeds with LlamaIndex.

Key Novelty

System-Level Evaluation Infrastructure (MASEval)

Decouples the system under test from the benchmark harness using adapters, allowing any agent framework to be evaluated on any benchmark
Treats the entire system (agents + framework + coordination logic) as the unit of analysis rather than just the model
Standardizes the evaluation lifecycle (Setup → Execute → Collect → Evaluate) to reduce boilerplate for benchmark producers and consumers

Evaluation Highlights

Framework choice creates a performance range of 12.4 percentage points (pp) on average, comparable to the 14.2 pp range driven by model choice
Claude-Haiku-4.5 performance on MACS Travel swings by 30.9 pp depending solely on the framework (90.4% with smolagents vs. 59.5% with LlamaIndex)
Reduces interface boilerplate code by 83–91% for benchmark consumers compared to original benchmark implementations

Breakthrough Assessment

9/10

MASEval fundamentally shifts the unit of analysis from models to systems, exposing a critical blind spot in current evaluation. Its infrastructure significantly lowers the barrier for rigorous cross-framework comparison.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of multi-agent systems across diverse tasks, frameworks, and models

Inputs: Task definition (query, environment), Agent System (model + framework implementation)

Outputs: Performance metrics (success rate, robustness), Execution Traces (per-agent message history)

Pipeline Flow

Setup (Initialize environment, agents, evaluators)
Execute (Run agent loop with turn orchestration)
Collect (Gather traces from agents and components)
Evaluate (Compute metrics from traces)
Report (Store structured results)

System Modules

AgentAdapter

Wraps arbitrary framework agents to expose a standard interface (step, get_history) for the core engine

Model or implementation: N/A (Interface)

Environment (Core Layer)

Manages state and exposes tools, persisting across turns within a task

Model or implementation: N/A (Logic)

Evaluator (Core Layer)

Filters traces and computes metrics (performance, robustness) after execution

Model or implementation: LLM-as-a-Judge (optional)

AdaptiveTaskQueue

Selects the most informative tasks to evaluate based on current skill estimates to reduce compute

Model or implementation: Statistical Model (e.g., IRT)

Novel Architectural Elements

Universal AgentAdapter abstraction enabling cross-framework evaluation without framework-specific code in the core
Decoupled lifecycle management (Setup-Execute-Collect-Evaluate) that separates system execution from measurement logic
Multi-agent tracing registry that automatically collates message histories from independent agents

Modeling

Base Model: Varies by experiment (GPT-5-mini, Gemini-3.0-Flash, Claude-Haiku-4.5)

Comparison to Prior Work

vs. Inspect AI: MASEval supports arbitrary multi-agent topologies and per-agent tracing, whereas Inspect's solver abstraction is single-agent focused
vs. AgentBench: MASEval evaluates the full system (framework + logic), not just the model capabilities
vs. LangSmith: MASEval is open-source and framework-agnostic, avoiding vendor lock-in
+ 1 more
vs. HAL: MASEval supports multi-agent systems and pluggable logging backends, not just W&B

Limitations

Initial setup cost is higher than opinionated libraries due to the 'Bring Your Own' adapter philosophy
Thinner per-platform feature surface compared to vendor-specific tools (no turnkey dashboards out of the box)
Evaluation results are noisy due to stochastic model outputs, requiring aggregation rather than definitive pairwise ranking
Robustness focuses on reproducibility and extensibility rather than defensive coding depth found in production systems

Reproducibility

Code: https://github.com/parameterlab/MASEval

📊 Experiments & Results

Evaluation Setup

Full factorial experiment: 3 Frameworks × 3 Models × 3 Benchmarks (2 domains each)

Benchmarks:

MACS (Multi-agent coordination (Travel, Enterprise))
ConVerse (Security robustness (Defense against attackers))
MultiAgentBench (Collaboration and competition (Research, Bargaining))

Metrics:

Partial Goal Success Rate (pGSR)
Robustness (1 - ASR)
Task Completion Rate
Task Score (TS)
Statistical methodology: Reported means and standard deviations of performance ranges across frameworks and models

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Analysis of performance variability shows that the choice of framework introduces as much variance in outcomes as the choice of model.
Average across all 6 domains	Mean Range (pp)	14.2	12.4	-1.8
Average across all 6 domains	Standard Deviation (pp)	7.5	6.5	-1.0
Specific case studies reveal extreme sensitivity to framework choice for certain models.
MACS (Travel)	pGSR	59.5	90.4	+30.9
Implementation efficiency analysis demonstrating reduction in boilerplate code for benchmark consumers.
ConVerse	Interface Lines of Code	154	14	-140
tau-bench	Interface Lines of Code	98	17	-81

Main Takeaways

Framework choice is not neutral; it impacts agent system performance comparably to model choice (avg 12.4pp vs 14.2pp range).
Framework-model interactions can cause catastrophic failures, such as GPT-5-mini looping on clarification questions in smolagents due to mandatory tool-calling conventions.
MASEval significantly reduces implementation overhead for using benchmarks (35–57% total code reduction), allowing researchers to focus on agent logic rather than orchestration boilerplate.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM-based agentic systems (tools, planning, memory)
Familiarity with agent frameworks (LangGraph, AutoGen, LlamaIndex)
Knowledge of software testing concepts (unit of analysis, mocking, tracing)

Key Terms

MASEval: The proposed framework-agnostic evaluation library for multi-agent systems

smolagents: A minimalist, code-centric agent framework by Hugging Face

LangGraph: A graph-based agent orchestration framework by LangChain allowing explicit state management

LlamaIndex: A data framework for LLMs that supports agentic workflows and retrieval

MACS: Multi-Agent Coordination Survey—a benchmark testing multi-agent coordination on enterprise tasks

ConVerse: A benchmark measuring resistance to security attacks in agent-to-agent conversations

pGSR: Partial Goal Success Rate—a metric measuring the percentage of sub-goals successfully achieved

ASR: Attack Success Rate—the frequency with which an attacker agent successfully compromises a victim agent

GPT-5-mini: A hypothetical/future mid-tier model used in the paper's experiments

Gemini-3.0-Flash: A hypothetical/future mid-tier model used in the paper's experiments

Claude-Haiku-4.5: A hypothetical/future mid-tier model used in the paper's experiments

TS: Task Score—a metric used in MultiAgentBench to quantify bargaining outcomes

Item Response Theory: A statistical paradigm used in the paper's AdaptiveTaskQueue to estimate skill from a subset of items

Orchestration logic: The code and rules governing how multiple agents interact, pass messages, and take turns