AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML

📝 Paper Summary

Multi-agent Automated machine learning (AutoML)

AutoML-Agent automates the entire machine learning pipeline—from data retrieval to model deployment—using specialized LLM agents coordinated via retrieval-augmented planning and multi-stage verification to ensure runnable, high-quality code.

Core Problem

Existing AutoML systems require high technical expertise to configure, while current LLM-based approaches usually handle only isolated pipeline steps (e.g., just HPO or feature engineering) or rely on slow, expensive training-based search.

Why it matters:

Current fragmentation leads to suboptimal solutions because decisions in data processing affect model design and vice versa
High configuration barriers prevent domain experts without coding skills from building effective ML solutions
Training-based search methods are computationally prohibitive for practical, rapid development

Concrete Example: A user asks for a spam detection model. A standard LLM might generate code with hallucinated dependencies or miss the inference latency constraint. A single-step tool might optimize the model architecture but fail to preprocess the text data correctly for that specific architecture, causing runtime errors.

Key Novelty

Retrieval-Augmented Planning with Role-Specific Decomposition

Instead of a single plan, the system retrieves external knowledge (like arXiv papers) to generate multiple diverse plans, then decomposes them into sub-tasks for specialized agents (Data Agent, Model Agent)
Uses 'prompting-based execution' where agents simulate execution and return expected results/metrics without running code, allowing fast exploration before the final code is written

Architecture

The complete workflow of AutoML-Agent, detailing the interaction between the User, Agent Manager, and specialized agents (Prompt, Data, Model, Operation).

Evaluation Highlights

Achieves 87.1% success rate in generating runnable, compliant pipelines under constraint-aware settings, significantly outperforming GPT-4 (50.7%) and DS-Agent (37.5%)
Reduces search time by ~8x compared to tree-search-based methods (SELA) while maintaining comparable or superior model performance
Outperforms human expert baselines and AutoGluon on normalized performance scores across 7 downstream tasks including image, text, and tabular data

Breakthrough Assessment

8/10

Significantly advances LLM-based AutoML by successfully integrating the full pipeline (data to deployment) with a practical, training-free search mechanism. The high success rate on complex constraints is a strong differentiator.

⚙️ Technical Details

Problem Definition

Setting: Automated generation of an end-to-end machine learning pipeline code based on natural language user instructions

Inputs: Natural language user instruction I (containing task description, constraints, and requirements)

Outputs: Deployment-ready model M* and corresponding inference endpoint code

Pipeline Flow

Initialization: Agent Manager (Amgr) receives and verifies user request
Planning: Prompt Agent (Ap) parses request; Amgr generates multiple plans via RAP
Execution: Data Agent (Ad) and Model Agent (Am) decompose and simulate plans (Prompting-Based Execution)
Verification: Amgr verifies simulated results; best plan sent to Operation Agent (Ao)
Implementation: Ao generates code; Amgr verifies code execution (ImpVer) and deploys

System Modules

Agent Manager (Amgr)

Orchestrates the entire process, verifies requests/results, selects best plans, and manages agent coordination

Model or implementation: GPT-4 (gpt-4o-2024-05-13)

Prompt Agent (Ap)

Parses user instructions into a standardized JSON object with keys like User, Problem, Dataset, Model

Model or implementation: Mixtral-8x7B-Instruct-v0.1 (fine-tuned)

Data Agent (Ad) (Execution)

Handles data retrieval, preprocessing, augmentation, and analysis tasks

Model or implementation: GPT-4 (gpt-4o-2024-05-13)

Model Agent (Am) (Execution)

Handles model selection, HPO, profiling, and candidate ranking

Model or implementation: GPT-4 (gpt-4o-2024-05-13)

Operation Agent (Ao)

Generates actual executable code for the selected best plan

Model or implementation: GPT-4 (gpt-4o-2024-05-13)

Novel Architectural Elements

Retrieval-Augmented Planning (RAP): Generates multiple distinct plans based on retrieved external knowledge rather than a single path
Prompting-Based Plan Execution: Agents simulate plan execution and return estimated metrics/results to rank plans without expensive code execution cycles
Three-Stage Verification Loop: Explicit verification steps at Request (ReqVer), Execution (ExecVer), and Implementation (ImpVer) phases with feedback loops

Modeling

Base Model: GPT-4 (gpt-4o-2024-05-13) for most agents; Mixtral-8x7B-Instruct-v0.1 for Prompt Agent

Training Method: Supervised Fine-Tuning (SFT) using LoRA

Adaptation: LoRA (Low-Rank Adaptation) applied to Prompt Agent only

Training Data:

2.3K instruction-response pairs generated via EvolInstruct
Seed instructions manually created, then expanded automatically

Compute: Single run takes ~525 seconds and costs ~$0.30 USD (using GPT-4o)

Comparison to Prior Work

vs. DS-Agent: AutoML-Agent supports full pipeline including deployment and uses retrieval-augmented planning instead of case banks; achieves higher success rate without relying on pre-existing code cases
vs. AutoGluon: AutoML-Agent offers a natural language interface and broader task coverage (e.g., graph, time-series) beyond tabular data, with customizable constraints
vs. SELA: AutoML-Agent uses prompting-based execution (simulated) rather than training-based search (Monte Carlo Tree Search), resulting in ~8x faster search times [SELA is cited in paper]
+ 1 more
vs. HuggingGPT: AutoML-Agent performs verification and specifically handles code generation for training/deployment, whereas HuggingGPT focuses on task planning and model inference API calls [cited in paper]

Limitations

Heavy reliance on closed-source LLMs (GPT-4) creates cost and privacy concerns
Pseudo-execution (prompting-based) may hallucinate feasibility of certain complex pipelines before actual code generation
Verification steps add latency compared to direct code generation approaches
Performance depends on the quality of retrieved external knowledge

Reproducibility

Code: https://github.com/deepauto-ai/automl-agent

publicly available (https://github.com/deepauto-ai/automl-agent). Code includes agent prompts and framework structure. Uses closed-source GPT-4 API for main logic, costing money to replicate.

📊 Experiments & Results

Evaluation Setup

Generate ML pipelines for 7 downstream tasks (Image/Text/Tabular/Graph/Time-Series) under two conditions: Constraint-Free and Constraint-Aware

Benchmarks:

Image Classification (Computer Vision)
Text Classification (NLP)
Tabular Classification/Regression (Classic ML)
Time-Series Forecasting (Time Series Analysis)
Node Classification (Graph Mining)

Metrics:

Success Rate (SR): 0-1 score based on compliance with requirements
Normalized Performance Score (NPS): Performance relative to baselines
Comprehensive Score (CS): Weighted average of SR and NPS
Statistical methodology: Average scores from five independent runs reported

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
AutoML-Agent demonstrates superior reliability in generating compliant code compared to baselines, especially when constraints are involved.
Constraint-Aware Setting (Average across 7 tasks)	Success Rate (SR)	0.507	0.871	+0.364
Constraint-Aware Setting (Average across 7 tasks)	Success Rate (SR)	0.375	0.871	+0.496
In terms of model performance (NPS), AutoML-Agent produces better models than both human experts and other AutoML tools.
Constraint-Aware Setting (Average across 7 tasks)	Normalized Performance Score (NPS)	0.864	0.963	+0.099
Constraint-Aware Setting (Average across 7 tasks)	Normalized Performance Score (NPS)	0.899	0.963	+0.064
Ablation studies confirm the necessity of the Retrieval-Augmented Planning (RAP) and Verification components.
Average CS Metric	Comprehensive Score (CS)	0.58	0.91	+0.33
Search Time (Average)	Seconds	3421	426	-2995

Experiment Figures

Breakdown of time and monetary cost per run.

Main Takeaways

Multi-stage verification is critical: The jump in Success Rate from raw GPT-4 (50.7%) to AutoML-Agent (87.1%) highlights the value of the Request-Execution-Implementation verification loop.
Prompting-based execution works: Simulating execution via LLM reasoning allows for effective plan ranking without the massive computational cost of actual training (as seen in the comparison with SELA).
Consistent performance across modalities: Unlike DS-Agent which excels in tabular data but struggles elsewhere, AutoML-Agent maintains high scores across Vision, NLP, and Graph tasks.
Retrieval augmentation improves constraint handling: The system successfully identifies how to meet specific latency or accuracy constraints by retrieving relevant technical knowledge during planning.

📚 Prerequisite Knowledge

Prerequisites

Understanding of the standard ML pipeline (preprocessing, HPO, deployment)
Familiarity with LLM prompting strategies (RAG, chain-of-thought)
Basic knowledge of AutoML concepts

Key Terms

AutoML: Automated Machine Learning—automating the application of machine learning to real-world problems

HPO: Hyperparameter Optimization—the process of choosing a set of optimal hyperparameters for a learning algorithm

RAP: Retrieval-Augmented Planning—generating plans by retrieving relevant external knowledge (e.g., papers, docs) to ground the LLM's reasoning

Plan Decomposition: Breaking down a high-level plan into granular sub-tasks specific to an agent's role (e.g., data cleaning for Data Agent)

Prompting-Based Execution: Simulating the execution of a plan step via LLM inference to estimate results without actually running code, used to speed up search

CS: Comprehensive Score—a combined metric of success rate (SR) and normalized performance score (NPS)

NPS: Normalized Performance Score—a transformation of loss-based metrics into a 0-1 scale for uniform comparison

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique for LLMs

JSON: JavaScript Object Notation—a standard text-based format for representing structured data

RMSLE: Root Mean Squared Logarithmic Error—a metric used for regression tasks