AFlow: Automating Agentic Workflow Generation

📝 Paper Summary

Automated Agentic Optimization Workflow Generation

AFLOW automates the creation of agentic workflows by treating them as code-based search problems, using Monte Carlo Tree Search to iteratively refine structures and prompts for higher performance and lower cost.

Core Problem

Manually designing agentic workflows (sequences of LLM invocations) requires significant human effort and limits scalability, while existing automated methods struggle with limited search spaces or inefficient exploration.

Why it matters:

Human-designed workflows are hard to scale to new domains and lack transferability
Existing automated methods (like ADAS) use linear heuristic search that fails to discover effective workflows efficiently
Optimizing workflows allows smaller, cheaper models to outperform larger, expensive models, democratizing advanced AI capabilities

Concrete Example: In GSM8K (math), a standard Chain-of-Thought workflow might fail on complex reasoning. Manually adding 'Review' or 'Ensemble' nodes is tedious. AFLOW automatically discovers a workflow that generates 5 solutions, ensembles them, and verifies the result using a Python programmer node, improving success rates without human design.

Key Novelty

MCTS-driven Search over Code-Represented Workflows

Reformulates workflow optimization as a search over code, where nodes are LLM calls and edges are logic (loops, conditionals)
Uses Monte Carlo Tree Search (MCTS) to navigate this infinite space, treating code modifications as search steps and storing successful patterns in the tree
Introduces 'Operators' (predefined code blocks like Ensemble or Review) to accelerate search, while allowing the LLM optimizer to write custom logic

Architecture

The overall AFLOW framework and iterative search cycle

Evaluation Highlights

Outperforms state-of-the-art baselines by 5.7% on average across 6 benchmarks (including Math, Code, and QA)
Surpasses ADAS (previous SOTA automated method) by 19.5% on average, with a 57% improvement on MATH lv5 and MBPP
Enables GPT-4o-mini (via AFLOW) to outperform GPT-4o on HumanEval (94.7% vs 93.9%) at only 4.55% of the inference cost

Breakthrough Assessment

9/10

Significant leap in automating agent design. The ability for smaller models to beat larger ones via discovered workflows is a major efficiency breakthrough. The move to code-based MCTS search solves the expressivity limits of graph-based methods.

⚙️ Technical Details

Problem Definition

Setting: Search for optimal workflow W* in space S = {(N, E)} to maximize evaluation function G(W, T) for task T

Inputs: Task dataset T (e.g., math problems), Evaluation function G (e.g., solve rate), Operator set O

Outputs: Optimized executable workflow code W*

Pipeline Flow

Initialization: Start with blank template or base workflow
Selection: Choose a parent workflow from the tree using Soft Mixed Probability
Expansion: LLM Optimizer modifies the parent's code (adding Operators or changing Prompts) to create a child workflow
Evaluation: Execute child workflow on validation set (5 runs) to get score
Backpropagation: Update tree with score and modification experience

System Modules

Selector (Search Mechanism)

Selects which workflow to improve next

Model or implementation: N/A (Algorithmic)

Optimizer (Search Mechanism)

Generates new workflow code based on parent and experience

Model or implementation: Claude-3.5-sonnet

Evaluator (Search Mechanism)

Tests workflow performance

Model or implementation: DeepSeek-V2.5 or GPT-4o-mini (Executor)

Novel Architectural Elements

Code-based Edge Representation within MCTS: Unlike graph-based MCTS (GPTSwarm), AFLOW searches over executable Python code, enabling loops/conditionals
Operator-based Search Space: Augments raw code generation with high-level semantic blocks (Ensemble, Programmer) to prune the search space
Tree-Structured Experience: Explicitly stores modification history and failure/success logs in the tree nodes to guide the LLM optimizer

Modeling

Base Model: Claude-3.5-sonnet (Optimizer), GPT-4o-mini/DeepSeek-V2.5 (Executor)

Training Method: Inference-time search/optimization (No weight updates)

Compute: Search runs for 20 iterations. Each iteration involves 5 executions on validation set (20% of dataset). Optimizer calls Claude-3.5-sonnet.

Comparison to Prior Work

vs. ADAS: AFLOW uses MCTS (tree) instead of linear search, preserving exploration history better and avoiding local optima
vs. GPTSwarm: AFLOW uses code representation for edges (allowing conditionals/loops) whereas GPTSwarm is limited to graph structures
vs. DSPy: AFLOW optimizes the entire workflow structure (topology), not just prompts/signatures within a fixed pipeline
+ 1 more
vs. TextGrad [not cited in paper]: TextGrad optimizes via backprop-like textual feedback on fixed graphs; AFLOW modifies the graph structure itself via code generation

Limitations

High computational cost during search phase due to multiple validations per iteration
Currently focused on reasoning tasks with numerical evaluation; harder to apply to open-ended tasks without clear metrics
Dependency on the capability of the Optimizer LLM (Claude-3.5-sonnet used) to write valid Python code

Reproducibility

Code: https://github.com/FoundationAgents/AFlow

publicly available (https://github.com/FoundationAgents/AFlow). Code template and prompt for optimizer provided in Appendix. Datasets are standard benchmarks.

📊 Experiments & Results

Evaluation Setup

Search performed on validation set (20%), final evaluation on held-out test set (80%).

Benchmarks:

HumanEval (Code Generation)
MBPP (Code Generation)
MATH (Mathematical Reasoning)
GSM8K (Mathematical Reasoning)
HotpotQA (Multi-hop Question Answering)
DROP (Reading Comprehension/Reasoning)

Metrics:

Solve Rate (%)
Pass@1
F1 Score
Statistical methodology: Tested 3 times on test set, average reported. Validation run 5 times.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
AFLOW consistently outperforms both manual baselines and the ADAS automated method across all benchmarks using GPT-4o-mini as executor.
HumanEval	Pass@1	82.4	94.7	+12.3
MBPP	Pass@1	53.4	83.4	+30.0
MATH	Solve Rate	35.4	56.2	+20.8
HotpotQA	F1	69.2	73.5	+4.3
GSM8K	Solve Rate	92.7	93.5	+0.8
AFLOW enables smaller models to rival or beat larger models (GPT-4o) on specific tasks.
HumanEval	Pass@1	93.9	94.7	+0.8

Experiment Figures

Bar chart comparing AFLOW against 7 baselines across 6 benchmarks.

Scatter plot showing Cost ($) vs Performance (Pass@1) on HumanEval.

Tree-structured iteration process on GSM8K.

Main Takeaways

AFLOW achieves a 5.7% average improvement over state-of-the-art baselines across six datasets.
Workflows generated by AFLOW are transferable: workflows optimized with DeepSeek-V2.5 perform well when executed by GPT-4o-mini, though model-specific optimization is best.
Cost-efficiency: AFLOW enables GPT-4o-mini to surpass GPT-4o on HumanEval while incurring only 4.55% of the inference cost.
Ablation shows Operators accelerate search, but AFLOW can still discover effective structures (like ensembles) from scratch without them, proving the robustness of the MCTS approach.

📚 Prerequisite Knowledge

Prerequisites

Large Language Models (LLMs) and prompting strategies
Monte Carlo Tree Search (MCTS)
Agentic workflows (Chain-of-Thought, ReAct, etc.)
Python programming concepts (async functions, classes)

Key Terms

MCTS: Monte Carlo Tree Search—a heuristic search algorithm that balances exploration (trying new paths) and exploitation (refining promising paths) using tree structures

Agentic Workflow: A structured sequence of LLM invocations (nodes) connected by logic (edges) to solve complex tasks

Node: A fundamental unit in a workflow representing a single LLM action (e.g., generate, review) with parameters like prompt and temperature

Edge: The logic connecting nodes, represented here as Python code (loops, conditionals) rather than static graph links

Operators: Predefined, reusable combinations of nodes (e.g., Ensemble, Review & Revise) acting as high-level building blocks for the search

ADAS: Automated Design of Agentic Systems—a prior method using linear heuristic search over code-based agents

Soft Mixed Probability Selection: A selection strategy in MCTS that combines uniform random selection with score-based weighting to prevent local optima

Pareto Front: The set of optimal trade-offs between two conflicting objectives (here, performance vs. cost), where no objective can be improved without sacrificing the other