SEW: Self-Evolving Agentic Workflows for Automated Code Generation

📝 Paper Summary

Multi-Agent Systems Automated Code Generation

SEW is a framework that automatically generates and optimizes multi-agent code generation workflows by evolving both the workflow topology and individual agent prompts using LLM-based mutation operators.

Core Problem

Current multi-agent systems for code generation rely on manually hand-crafted workflows and prompts, which are inefficient to design and fail to adapt to the specific complexity of different coding tasks.

Why it matters:

Manual workflow design is labor-intensive and requires domain expertise, limiting scalability
Static workflows cannot leverage the full potential of LLMs to autonomously adapt strategies for complex problems
A workflow optimized for one domain (e.g., machine learning) often fails in another (e.g., software development), necessitating automated adaptation

Concrete Example: When asked to implement `sum_squares(n)`, a single agent might write a loop `range(1, n)` that misses the last integer. SEW evolves a workflow where a 'Code Reviewing Agent' detects this off-by-one error and instructs a 'Code Rewriting Agent' to correct the range to `range(1, n+1)`.

Key Novelty

Self-Evolving Workflow (SEW) with Dual Evolution

Jointly optimizes the team structure (workflow topology) and the specific instructions (prompts) for each agent using LLMs as mutation operators
Introduces Direct Evolution (modifying prompts directly) and Hyper Evolution (modifying the mutation prompts themselves) to escape local optima
Identifies CoRE (Code Representation and Execution) as the optimal textual representation for LLMs to generate and modify executable agentic workflows

Architecture

The overall SEW framework pipeline, showing the progression from Workflow Generation to Workflow Evolution, and finally Agent Evolution.

Evaluation Highlights

Achieves 50.9% pass@1 on LiveCodeBench using GPT-4o mini, outperforming the backbone model (38.0%) and PromptBreeder (45.9%)
Agent evolution module improves the performance of the Task Parsing Workflow by 20.3% on LiveCodeBench compared to using only workflow evolution
Demonstrates that CoRE representation yields the highest Generation Successful Rate (72.7%) compared to Python (29.1%) and BPMN (47.3%)

Breakthrough Assessment

8/10

Significant step in automating agent design. The rigorous comparison of workflow representations (CoRE vs BPMN) and the dual-layer evolution (workflow + agent) provide a robust framework for self-improving systems.

⚙️ Technical Details

Problem Definition

Setting: Automatic code generation where an input problem description is transformed into executable code via a multi-agent system

Inputs: Natural language task description D

Outputs: Executable Python code solution

Pipeline Flow

Workflow Generation (Initial Template)
Workflow Evolution (Topology Optimization)
Agent Evolution (Prompt Optimization via DE/HE)

System Modules

Workflow Generator

Generate initial default workflows based on the task description and a template workflow

Model or implementation: LLM (e.g., GPT-4o mini)

Workflow Evolver (Evolution)

Reconstruct and optimize the topology of the workflow (adding/removing/reordering agents)

Model or implementation: LLM (e.g., GPT-4o mini)

Agent Evolver (Evolution)

Optimize the specific prompt for each agent within the evolved workflow using DE or HE operators

Model or implementation: LLM (e.g., GPT-4o mini)

Novel Architectural Elements

Dual-loop evolutionary architecture: Outer loop optimizes workflow topology, inner loop optimizes individual agent prompts
Utilization of LLMs as mutation operators for both code-based workflow structures and natural language prompts

Modeling

Base Model: GPT-4o mini and Gemini-1.5-pro-002 (used as backbone for all agents and evolutionary operators)

Comparison to Prior Work

vs. PromptBreeder: SEW evolves the workflow topology (structure) in addition to the prompts, whereas PromptBreeder focuses on prompt evolution
vs. AFlow: SEW explores specific representation schemes (CoRE) for better LLM interpretation and uses Hyper Evolution for prompts, rather than MCTS
vs. ADAS: SEW provides a granular analysis of workflow representation schemes (CoRE vs BPMN) to maximize generation success
+ 1 more
vs. EvoAgent [not cited in paper]: EvoAgent expands expert agents into multi-agent systems; SEW constructs workflows from scratch based solely on task descriptions

Limitations

Generalization to non-coding tasks (e.g., reasoning, planning) remains untested
Some generated workflows are logically valid but fail to produce executable outputs (execution constraints)
Performance is heavily dependent on the capabilities of the underlying backbone LLM

Reproducibility

Code: https://github.com/EvoAgentX/EvoAgentX

Code is publicly available at https://github.com/EvoAgentX/EvoAgentX. The paper provides specific examples of prompts (mutation, hyper-mutation) and workflow representations (CoRE, Python, YAML) in the Appendix. Task descriptions for benchmarks are standard.

📊 Experiments & Results

Evaluation Setup

Code generation from natural language descriptions

Benchmarks:

LiveCodeBench (LCB) (Challenging code generation (Code Generation subset))
HumanEval (Python coding problems)
MBPP (Basic Python problems)

Metrics:

pass@1
pass@5
pass@10
LSR (Logical Successful Rate)
GSR (Generation Successful Rate)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on standard benchmarks showing SEW outperforming baselines.
LiveCodeBench	pass@1	38.0	50.9	+12.9
LiveCodeBench	pass@1	45.9	50.9	+5.0
HumanEval	pass@1	91.6	92.1	+0.5
Analysis of workflow representation schemes to determine optimal format for LLM generation.
Workflow Representation Analysis	GSR	29.1%	72.7%	+43.6%
Impact of combining workflow evolution with agent evolution.
LiveCodeBench	pass@1	42.3	50.9	+8.6

Experiment Figures

Box plots comparing the performance distribution (pass@1, @5, @10) of different evolution strategies (DE vs HE) on LiveCodeBench.

Main Takeaways

CoRE (Code Representation and Execution) is the most effective representation for agentic workflows, balancing logical correctness and execution success better than BPMN or Python.
Workflow evolution creates novel topologies, but Agent evolution is crucial for unlocking full performance (injecting high-quality prompts into the structure).
Hyper Evolution (HE) offers more robust/consistent performance across tasks (lower variance), while Direct Evolution (DE) achieves higher peak performance.
SEW consistently outperforms single-agent baselines and state-of-the-art workflow optimization methods like AFlow and ADAS across multiple datasets.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and Prompt Engineering
Familiarity with Multi-Agent Systems and Agentic Workflows
Basic knowledge of Evolutionary Algorithms (mutation, population)

Key Terms

SEW: Self-Evolving Workflow—the proposed framework for automatically generating and optimizing multi-agent systems

CoRE: Code Representation and Execution—a unified framework integrating natural language, pseudo-code, and flow-based programming for defining workflows

BPMN: Business Process Model and Notation—a graphical standard for modeling business processes and workflows

GSR: Generation Successful Rate—the probability that a generated workflow produces executable Python code

LSR: Logical Successful Rate—the probability that a generated workflow is structurally valid according to the representation scheme

DE: Direct Evolution—an operator where an LLM directly modifies an agent's prompt using a mutation prompt

HE: Hyper Evolution—an operator where an LLM first modifies the mutation prompt itself, then uses the new mutation prompt to modify the agent

mutation prompt: A meta-prompt used to instruct an LLM on how to modify or improve another prompt (e.g., 'Modify this instruction to be more creative')

pass@k: A metric measuring the probability that at least one of the top k generated code samples is correct