Automated Design of Agentic Systems

📝 Paper Summary

Self-evolving Agentic reasoning Automated Design of Agentic Systems (ADAS)

Meta Agent Search uses a meta-agent to iteratively write and refine Python code for new agentic systems, automatically discovering novel designs that outperform hand-crafted workflows like Chain-of-Thought.

Core Problem

Developing powerful agentic systems currently relies on manual, domain-specific tuning of building blocks (like prompts, tool use, and workflows), which is labor-intensive and likely suboptimal compared to learned solutions.

Why it matters:

The 'Bitter Lesson' suggests manually designed artifacts are eventually replaced by learned ones as compute scales, yet agent design remains largely manual.
The space of possible agentic workflows is vast; humans are unlikely to discover complex, non-intuitive combinations of prompting, loops, and tool usage that maximize performance.

Concrete Example: In the ARC logic puzzle challenge, standard Chain-of-Thought fails because it lacks verification. Meta Agent Search automatically discovered a 'Structured Feedback and Ensemble Agent' that generates multiple solutions, uses specific 'expert' personas (simplicity, efficiency) to critique them, and ensembles the best answers—a complex workflow a human might not hand-code.

Key Novelty

Meta Agent Search in Code Space

Defines the search space for agents as 'any valid Python code', allowing the discovery of arbitrary control flows, loops, and tool usages rather than just optimizing text prompts.
Employs a 'Meta Agent' (an LLM) that iteratively programs new agents based on an archive of past high-performing designs, effectively 'learning to invent' better agents.

Architecture

The Meta Agent Search algorithm workflow.

Evaluation Highlights

+13.6 F1 score improvement on the DROP reading comprehension benchmark compared to the best state-of-the-art hand-designed agent.
+14.4% accuracy improvement on the MGSM math benchmark compared to the best hand-designed baselines.
Achieves 25.9% accuracy gain on GSM8K when transferring an agent discovered on a different math domain (MGSM), outperforming domain-agnostic baselines.

Breakthrough Assessment

9/10

Establishes a new paradigm (ADAS) by demonstrating that agents defined in code can be automatically discovered by LLMs. The performance gains over hand-crafted baselines are substantial and the transferability is surprising.

⚙️ Technical Details

Problem Definition

Setting: Search for an agent program A (defined in code) within a search space S that maximizes an evaluation function f(A) (e.g., accuracy) on a task T.

Inputs: A task description and a set of validation examples.

Outputs: Python code defining a 'forward' function for a new agentic system.

Pipeline Flow

Archive Retrieval: Meta Agent samples past agents from the archive.
Generation: Meta Agent writes code for a new agent based on insights from sampled agents.
Evaluation: New agent code is executed on validation tasks.
Reflection/Refine: If execution fails, Meta Agent debugs the code.
Archive Update: Successful agents and their metrics are added to the archive.

System Modules

Meta Agent

Programs new agents by generating Python code and high-level descriptions.

Model or implementation: GPT-4

Agent Framework

Provides the API and execution sandbox for the generated agents.

Model or implementation: Python Interpreter

📊 Experiments & Results

Evaluation Setup

Search conducted on validation sets; final evaluation on held-out test sets. Search runs for 25-30 iterations.

Benchmarks:

ARC Challenge (Visual logic/abstraction puzzles)
DROP (Reading Comprehension)
MGSM (Multilingual Math)
MMLU (Multi-task Problem Solving)
GPQA (Graduate-Level Science)

Metrics:

Accuracy
F1 Score
Statistical methodology: 95% bootstrap confidence interval reported.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison results across multiple domains showing ADAS consistently outperforming hand-designed baselines.
DROP (Reading Comprehension)	F1 Score	65.8	79.4	+13.6
MGSM (Math)	Accuracy	39.0	53.4	+14.4
MMLU (Multi-task)	Accuracy	67.6	69.6	+2.0
Transferability experiments showing agents discovered on one math domain (MGSM) perform well on held-out math and non-math domains.
GSM8K	Accuracy	43.6	69.5	+25.9
GSM-Hard	Accuracy	18.0	31.2	+13.2

Experiment Figures

Performance trajectory on the ARC challenge over iterations.

Main Takeaways

Meta Agent Search discovers agents that significantly outperform state-of-the-art hand-designed baselines (like CoT-SC, Self-Refine) across math and reading tasks.
The discovered agents exhibit strong transferability: an agent found for MGSM transfers effectively to GSM8K (+25.9% vs baseline) and even to non-math domains like DROP.
Search in code space allows the emergence of complex behaviors like 'expert committees' and 'dynamic memory' that are hard to engineer manually.
Improvements are larger in reasoning-heavy tasks (Math, Reading) than knowledge-heavy tasks (MMLU, Science), likely because better workflows can mitigate reasoning errors but cannot fabricate missing knowledge.

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Foundation Models (LLMs) and prompting strategies (CoT, etc.)
Understanding of evolutionary or open-ended search algorithms
Basic Python programming concepts (control flow, functions)

Key Terms

ADAS: Automated Design of Agentic Systems—a research area aiming to automatically invent novel building blocks and designs for agentic systems.

Meta Agent: An LLM agent tasked with programming, evaluating, and refining *other* agents (Worker Agents).

Chain-of-Thought (CoT): A prompting technique where the model generates intermediate reasoning steps before the final answer.

Self-Refine: An agentic pattern where the model generates an initial output, critiques it, and generates a refined output.

Foundation Models (FMs): Large-scale pre-trained models (like GPT-4) used as the core reasoning engine within agentic systems.

Turing Complete: A system (like a programming language) capable of simulating any Turing machine, meaning it can theoretically represent any computable algorithm.

Quality-Diversity: Search algorithms that aim to find a set of solutions that are both high-performing and diverse in their behavior.

ARC: Abstraction and Reasoning Corpus—a benchmark measuring general intelligence through few-shot visual grid transformation tasks.