Multi-agent Architecture Search via Agentic Supernet

📝 Paper Summary

Automated Multi-Agent System Design Agentic Architecture Search Resource-Efficient Agents

MaAS replaces static multi-agent workflows with an agentic supernet that dynamically samples query-specific architectures, balancing performance with token costs via differentiable and textual gradient optimization.

Core Problem

Existing automated agent design methods search for a single, complex, 'one-size-fits-all' workflow, which is inefficient for simple queries and fails to adapt to diverse domains within a single benchmark.

Why it matters:

Deploying complex multi-agent systems for simple tasks (e.g., elementary arithmetic) wastes significant computational resources and money (token costs)
Static architectures struggle with heterogeneous benchmarks (e.g., GAIA) where some tasks need web search while others need file reading, forcing practitioners to split datasets manually
Current SOTA methods like AFlow optimize for performance but ignore the prohibitive inference costs of massive agent teams

Concrete Example: For a simple arithmetic question like '2+2', current systems might trigger a complex multi-agent debate consuming thousands of tokens. Conversely, for a Ph.D.-level algebra problem, a simple chain-of-thought fails. A static system cannot optimally handle both.

Key Novelty

Agentic Supernet (MaAS)

Paradigm shift from searching for one optimal graph to optimizing a probability distribution over many possible agent architectures (the supernet)
Introduces a controller that inspects the query difficulty and samples a custom multi-agent topology (e.g., simple I/O for easy tasks, multi-turn debate for hard ones) per instance
Optimizes discrete agent components (prompts, tools) using textual gradients while optimizing architecture probabilities using differentiable sampling

Architecture

The overall framework of MaAS, showing how a query is processed by a controller to sample a subnetwork from the agentic supernet.

Evaluation Highlights

Achieves 51.82% accuracy on MATH benchmark with only $0.42 inference cost, compared to AFlow's 51.28% accuracy at $1.66 cost (approx. 4x cheaper)
Outperforms state-of-the-art automated methods by 0.54% to 16.89% across six benchmarks including HumanEval and GSM8K
Reduces training costs significantly: optimizes in 53 minutes for $3.38 on MATH, whereas comparable baseline AFlow requires 184 minutes and $22.50

Breakthrough Assessment

8/10

Significantly advances automated agent design by solving the efficiency vs. performance trade-off. The concept of an 'agentic supernet' with dynamic routing is a strong conceptual leap over static graph search.

⚙️ Technical Details

Problem Definition

Setting: Multi-objective optimization of a conditional probability distribution over directed acyclic graphs (DAGs) of agentic operators

Inputs: Natural language query q

Outputs: Sampled multi-agent system G and final solution a

Pipeline Flow

Input Query → Controller Network (Embedding + FFN)
Layer-wise Sampling (Selects operators like CoT, Debate, or Early-Exit based on probabilities)
Execution (Run the sampled Multi-Agent System)
Feedback Loop (Update distribution via Policy Gradient, update operators via Textual Gradient)

System Modules

Controller Network

Predict activation scores for available operators at each layer based on query embedding

Model or implementation: MiniLM / Sentence-BERT (for embedding) + FFN

Agentic Operator Execution

Execute specific agentic behaviors (Reasoning, Coding, Tool Use)

Model or implementation: gpt-4o-mini-0718 (or other backbones like Qwen-2.5-72b)

Supernet Optimizer

Update architecture probabilities and operator definitions

Model or implementation: Gradient-based optimizer + LLM-based Textual Gradient generator

Novel Architectural Elements

Agentic Supernet: A cascaded multi-layer workflow representing a continuous distribution of architectures rather than a single static graph
Query-Dependent Routing: Dynamic allocation of inference depth and width (operators) via a learned controller
Dual Optimization Loop: Simultaneously updating continuous architecture probabilities (numerical gradient) and discrete agent prompts/tools (textual gradient)

Modeling

Base Model: gpt-4o-mini-0718 (primary), Qwen-2.5-72b-instruct, llama-3.1-70b

Training Method: Bilevel optimization: Policy Gradient for architecture search + Textual Gradient for prompt evolution

Objective Functions:

Purpose: Maximize solution accuracy while minimizing token cost.

Formally: max E[U(G; q, a) - λ * C(G; q)] where U is utility, C is cost, λ is penalty.
Purpose: Update architecture distribution parameters.

Formally: ∇π L ≈ Σ mk * ∇π p(Gk) using cost-aware importance weights mk.
Purpose: Update operator prompts/tools.

Formally: ∇O = Textual_Gradient(Prompt, Temperature, Node_Structure)

Key Hyperparameters:

layers_L: 4
cost_penalty_lambda: 1e-3, 5e-3, 1e-2
sampling_times_K: 4
+ 1 more
threshold_thres: 0.3

Compute: Optimization on MATH took 53 minutes (wall-clock) with gpt-4o-mini. Training cost $3.38.

Comparison to Prior Work

vs. AFlow: MaAS generates a distribution of architectures (dynamic) rather than a single graph (static), resulting in 4x lower inference cost
vs. GPTSwarm: MaAS optimizes both graph structure and node prompts simultaneously using textual gradients, whereas GPTSwarm focuses on communication
vs. AutoGPT [not cited in paper]: MaAS provides a structured search over valid architectures rather than open-ended recursive loops

Limitations

Relies on the quality of the 'Textual Gradient' provider (LLM capability to critique itself)
Performance gains diminish after 4 layers of depth
Cost-performance trade-off relies on sensitive tuning of the lambda parameter

Reproducibility

Code: https://github.com/bingreeky/MaAS

Code is publicly available at https://github.com/bingreeky/MaAS. Dataset statistics provided. Uses closed-source OpenAI APIs and open-source models via API.

📊 Experiments & Results

Evaluation Setup

Evaluated on code generation, math reasoning, and general tool-use tasks.

Benchmarks:

GSM8K (Math Reasoning)
MATH (Hard Math Reasoning)
HumanEval (Code Generation)
MBPP (Code Generation)
GAIA (General Assistant / Tool Use)

Metrics:

Accuracy / Pass@1
Token Cost ($)
Wall-clock Time
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance comparisons on standard benchmarks showing MaAS superiority over static and automated baselines.
MATH	Accuracy	51.28	51.82	+0.54
HumanEval	Pass@1	90.93	92.85	+1.92
GAIA (Level 1)	Accuracy	10.75	25.91	+15.16
Cost efficiency analysis on the MATH benchmark.
MATH	Inference Total Cost ($)	1.66	0.42	-1.24
MATH	Training Total Cost ($)	22.50	3.38	-19.12
Ablation study validating component contributions.
HumanEval	Pass@1	90.17	92.85	+2.68

Experiment Figures

Cost analysis on MATH benchmark comparing Training Tokens, Inference API Cost, and Accuracy.

Visualization of operator sampling probabilities for queries of varying difficulty.

Main Takeaways

Dynamic resource allocation is effective: MaAS successfully learns to use 'Early Exit' for simple queries (e.g., 2+2) and complex agentic stacks for hard ones.
Cost-Performance Pareto Frontier: MaAS dominates existing methods by finding solutions that are both more accurate and significantly cheaper.
Cross-domain robustness: Unlike static systems that struggle on mixed benchmarks like GAIA, MaAS adapts its architecture per-instance to handle different task types (web search vs. file reading).

📚 Prerequisite Knowledge

Prerequisites

Neural Architecture Search (NAS) concepts (supernet, differentiable search)
LLM-based Agents (CoT, ReAct, Reflexion)
Reinforcement Learning (policy gradient estimation)

Key Terms

Agentic Supernet: A probabilistic, continuous distribution of agentic architectures that encompasses a vast number of possible multi-agent candidates

Textual Gradient: A method to approximate gradients for discrete text components (prompts, tools) by asking an LLM to analyze errors and suggest updates in natural language

Agentic Operator: A basic unit of the search space, representing a composite LLM process (e.g., Chain-of-Thought, Debate, ReAct) with specific prompts and tools

Early-exit Operator: A specific operator that allows the system to terminate the reasoning process at shallower layers for simple queries, saving tokens

MaAS: Multi-agent Architecture Search—the proposed framework that samples query-dependent systems from the supernet

Controller Network: A neural network that takes the query and current state to output probabilities for selecting the next agentic operator

MoE: Mixture-of-Experts—a neural network architecture where different parts (experts) are activated for different inputs; used here to implement the sampling process