Single-agent or Multi-agent Systems? Why Not Both?

📝 Paper Summary

Multi-agent vs. Single-agent comparison Agentic system optimization Hybrid agent architectures

As frontier LLMs improve, the accuracy gap between single and multi-agent systems narrows while cost disparities widen, motivating a hybrid approach that dynamically routes requests between them.

Core Problem

Multi-agent systems (MAS) incur significantly higher complexity and cost than single-agent systems (SAS), and their accuracy advantage is diminishing as frontier LLMs improve in long-context reasoning.

Why it matters:

Deploying MAS involves high engineering effort and runtime costs (latency, tokens), which may not be justifiable if accuracy gains are minimal
MAS can degrade performance due to coordination breakdowns and 'overthinking' on simple tasks
Practitioners lack guidance on navigating the accuracy-efficiency tradeoff when choosing between SAS and MAS

Concrete Example: In code generation tasks using the Self-Collab framework, 'problem analyst' and 'tester' agents may introduce unnecessary corner cases, overwhelming the 'coder' agent and causing it to fail on a task that a single agent could solve correctly.

Key Novelty

Hybrid Agent Routing and Cascading

Formalizes agent execution as a dependency graph to identify 'critical agents' that bottleneck performance
Proposes a 'confidence-guided tracing' method to attribute errors to specific agents based on confidence and output quality
Introduces 'Agent Routing' and 'Agent Cascade' paradigms to selectively offload requests between SAS and MAS, optimizing the accuracy-efficiency frontier

Architecture

Comparison of Single-Agent vs. Multi-Agent paradigms and the proposed hybrid approach.

Evaluation Highlights

Hybrid design improves accuracy by 1.1% to 12% across various agentic applications compared to pure MAS or SAS baselines
Reduces deployment costs by up to 88.1% compared to running MAS alone
SAS with Gemini-2.0-Flash matches or beats MAS on simple tasks, with MAS input token costs being 4–220× higher than SAS

Breakthrough Assessment

8/10

Provides a critical, empirical reassessment of the prevailing 'multi-agent is better' narrative. The proposed hybrid routing mechanism offers a practical solution to the cost/accuracy tradeoff.

⚙️ Technical Details

Problem Definition

Setting: Execution of a user request r through a directed graph G=(V,E) where nodes are LLM agents or tools

Inputs: User request r

Outputs: Final output G(r) and cumulative cost C(r)

Pipeline Flow

Input Request
Router/Cascade Controller (decides SAS vs MAS)
Execution (SAS or MAS Graph)
Output

System Modules

Router / Cascade Controller

Decides whether to route the request to a Single-Agent or Multi-Agent System

Model or implementation: Not explicitly specified (likely a classifier or heuristic based on confidence)

Confidence-Guided Tracer

Identifies critical agents that bottleneck performance by analyzing confidence and output quality

Model or implementation: Analytical algorithm on execution graph

Novel Architectural Elements

Hybrid routing/cascading architecture connecting SAS and MAS workflows
Graph-based error attribution mechanism using confidence scores to pinpoint critical nodes

Modeling

Base Model: Gemini-2.0-Flash (default for experiments), also evaluated on GPT-4o, Gemini-2.0-Pro, Llama-4

Training Method: Inference-only evaluation and architectural orchestration

Adaptation: None (Prompt engineering only)

Trainable Parameters: None

Compute: Not reported in the paper

Comparison to Prior Work

vs. MetaGPT/ChatDev: Proposed hybrid approach dynamically switches between SAS and MAS rather than using MAS for all queries
vs. Prior MAS studies: Re-evaluates performance with frontier models (Gemini 2.0/GPT-4o), finding diminished returns for MAS compared to older baselines (e.g., GPT-3.5)
vs. AutoGen [not cited in paper]: Focuses on when to use MAS vs SAS, rather than just facilitating the creation of MAS

Limitations

Dependency on accurate confidence estimation for the routing/tracing mechanism
Evaluation primarily focuses on specific tasks (code, math, travel), potentially limiting generalization to other domains
The definition of 'critical agent' assumes bottlenecks are node-centric rather than purely interaction-centric

Reproducibility

Paper defines the graph formulation and defects clearly. Specific implementation details of the routing classifier (training data, features) are less detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Comparison of SAS and MAS across diverse agentic tasks using 9 frameworks

Benchmarks:

HumanEval (Code Generation)
MBPP (Code Generation)
MATH (Mathematical Reasoning)
GSM8K (Mathematical Reasoning)
TravelPlanner (Planning)
AIME (Advanced Math (Olympiad))

Metrics:

Accuracy (Pass@1)
Token Cost (Input/Output)
Win Rate (MAS vs SAS)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Re-evaluation of reported MAS gains using modern frontier models (Gemini-2.0-Flash) shows a diminishing gap compared to original papers (often using GPT-3.5).
Average across tasks	Accuracy Improvement (MAS - SAS)	10	3	-7
Cost analysis reveals massive overhead for MAS compared to SAS.
Average across 7 datasets	Input Token Multiplier (MAS / SAS)	1	4	3
Average across 7 datasets	Input Token Multiplier (MAS / SAS)	1	220	219
Performance of the proposed hybrid architecture.
Various agentic applications	Accuracy Improvement	Not reported in the paper	Not reported in the paper	+1.1% to +12%
Various agentic applications	Cost Reduction	100	11.9	-88.1

Experiment Figures

Breakdown of correctness comparison between MAS and SAS (Both Pass, Both Fail, MAS Win, SAS Win).

Main Takeaways

The advantage of MAS over SAS diminishes as base LLM capabilities (context, reasoning) improve, except on extremely hard tasks like AIME.
MAS frequently 'overthinks' simple tasks, leading to lower performance than SAS in specific subsets of data.
Most instances (approx. 80%) result in ties (Both Pass or Both Fail), suggesting that for many inputs, the complexity of MAS is unnecessary.
Three primary defects in MAS identified: Node-Level (critical agent bottleneck), Edge-Level (overwhelming downstream agents), and Path-Level (error propagation).

📚 Prerequisite Knowledge

Prerequisites

Large Language Models (LLMs) and prompting strategies
Multi-Agent Systems (MAS) architectures
Graph theory basics (nodes, edges, dependency graphs)

Key Terms

MAS: Multi-Agent Systems—systems where complex tasks are decomposed and delegated to specialized LLM agents with defined roles

SAS: Single-Agent Systems—systems where a single LLM agent handles the entire task, possibly with tool use

Agent Cascade: A hybrid paradigm where a request is first processed by a cheaper/simpler system (SAS) and only escalated to a complex system (MAS) if necessary

Agent Routing: A mechanism to dynamically direct requests to either SAS or MAS based on predicted difficulty or agent confidence

Dependency Graph: A graph abstraction of agent execution where nodes are agents/tools and edges represent communication/dependencies

Critical Agent: The specific agent in a workflow whose performance bottlenecks the overall system's success

Overthinking: A failure mode where agents generate excessive reasoning or corner cases that confuse downstream agents