AIDE: AI-Driven Exploration in the Space of Code

📝 Paper Summary

Self-evolving Agentic reasoning Multi-call tool use with flexible plan

AIDE automates machine learning engineering by modeling the trial-and-error process as a tree search over Python scripts, using LLMs to iteratively draft, debug, and improve code based on performance feedback.

Core Problem

Developing high-performance ML models requires labor-intensive trial-and-error (debugging, tuning, architectural changes) that standard agents struggle to manage due to long context windows and lack of structured exploration.

Why it matters:

Traditional AutoML is limited to predefined configuration spaces (hyperparameters) and cannot innovate on code structure or data processing like human engineers.
ReACT-style agents simply append history to the context, leading to context overflow and incoherent optimization attempts over long horizons.
Human engineers spend significant time on tedious iteration rather than conceptualizing high-level research hypotheses.

Concrete Example: In a Kaggle competition, a standard agent might generate a script, encounter a tensor shape mismatch, and fail to fix it because the error log is lost in a long conversation history. AIDE would treat the buggy script as a node in a tree, specifically target it with a 'Debug' operation using only relevant logs, and spawn a corrected child node.

Key Novelty

Tree Search in the Space of Code (AIDE)

Frames ML engineering as a discrete optimization problem where every state is a standalone Python script and actions are code edits (Draft, Debug, Improve).
Replaces monolithic conversation history with a solution tree, using a 'Summarization Operator' to compress past attempts into concise hints for the next step, keeping prompts short and focused.
Decouples the search policy (which solution to work on next) from the coding capability (how to write the code), allowing for systematic exploration rather than linear greedy steps.

Evaluation Highlights

Achieves 36.4% medal rate on MLE-Bench Lite with o1-preview, a nearly 5x improvement over the model alone (7.6%) and significantly outperforming OpenHands.
Surpasses 51.38% of human competitors on average across 16 tabular Kaggle competitions, outperforming H2O AutoML (35.34%) and AutoGPT (32.34%).
Outperformed 9 human experts in the 'Optimize a Kernel' task (Triton kernel optimization) on RE-Bench by discovering a solution faster than any human.

Breakthrough Assessment

9/10

AIDE demonstrates a step-change in automated engineering, moving from parameter tuning to actual code evolution. Its dominance on MLE-Bench and ability to beat humans on complex coding tasks (Triton kernels) marks a significant advance in agentic coding.

⚙️ Technical Details

Problem Definition

Setting: Optimization problem over a space of code solutions S to maximize a stateless objective function h(s) (e.g., validation accuracy).

Inputs: Dataset (tabular or other), task description, and evaluation metric.

Outputs: Optimal Python script s* that maximizes the evaluation metric.

Pipeline Flow

Search Policy (Selects a script node to expand)
Summarization Operator (Extracts metrics/hints from parent node)
Coding Operator (LLM generates new code via Draft/Debug/Improve)
Evaluator (Runs code, computes metric)
Tree Update (Adds new node to Solution Tree)

System Modules

Search Policy (π)

Determines which solution to refine next based on heuristics (e.g., debug if broken, improve if valid and best-so-far)

Model or implementation: Hard-coded algorithm (Deterministic)

Summarization Operator (Σ)

Compresses the history of a solution branch into concise hints to prevent context overflow

Model or implementation: LLM-based extraction (implicit in design)

Coding Operator (f)

Generates executable Python code based on the action type

Model or implementation: Large Language Model (e.g., GPT-4o, o1-preview)

Evaluator (h)

Executes the generated script and returns a scalar performance score

Model or implementation: Python Execution Environment

Novel Architectural Elements

Solution Tree data structure replacing linear conversation history
Explicit distinction between Drafting, Debugging, and Improving actions in the coding operator
Summarization Operator to enable 'stateless' code evolution based only on relevant parent context

Modeling

Base Model: GPT-4o and o1-preview (used in experiments)

Comparison to Prior Work

vs. H2O AutoML: AIDE searches in code space (Python scripts) rather than configuration space, allowing novel feature engineering and architectures.
vs. AutoGPT: AIDE uses a tree structure and summarization to manage context, whereas AutoGPT uses a linear log that fills up context windows.
vs. OpenHands: AIDE uses a specialized ML-focused search policy (Draft/Debug/Improve) rather than a generic software engineering loop, resulting in 4x more medals on MLE-Bench.

Limitations

Risk of data contamination since LLMs may have seen Kaggle competition data during pre-training.
Evaluation relies on holdout sets that may differ from official Kaggle private test sets.
Struggles in environments requiring large codebase modifications or multi-step interactions (e.g., Rust CodeContests).

Reproducibility

Code: https://github.com/WecoAI/aideml

Code is publicly available at https://github.com/WecoAI/aideml. Weco-Kaggle benchmark details are in Appendix C. MLE-Bench and RE-Bench are external benchmarks with their own repositories.

📊 Experiments & Results

Evaluation Setup

Autonomous participation in Machine Learning competitions and R&D tasks.

Benchmarks:

Weco-Kaggle Lite (16 Tabular ML Competitions) [New]
MLE-Bench (75 Kaggle Competitions (Deep Learning & Tabular))
RE-Bench (METR) (AI Research & Development Tasks (e.g., Kernel Optimization))

Metrics:

Exceeds % of Human (Quantile Performance)
Medal Rate (Gold/Silver/Bronze)
Valid Submission Rate
Statistical methodology: Two-tailed t-test reported for MLE-Bench Lite comparisons (p < 0.01).

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Weco-Kaggle Lite results demonstrating AIDE's superiority over traditional AutoML and generic agents on tabular data.
Weco-Kaggle Lite	Exceeds % of Human	35.34	51.38	+16.04
Weco-Kaggle Lite	Exceeds % of Human	32.34	51.38	+19.04
MLE-Bench Lite results (from OpenAI's evaluation) showing AIDE's impact when powering state-of-the-art models.
MLE-Bench Lite	Any Medal Rate (%)	7.6	36.4	+28.8
MLE-Bench Lite	Gold Medal Rate (%)	6.1	21.2	+15.1
MLE-Bench Lite	Valid Submission Rate (%)	63.6	92.4	+28.8

Experiment Figures

Performance comparison on MLE-Bench Lite between o1-preview alone and AIDE with o1-preview across multiple metrics.

Performance over time on RE-Bench tasks compared to human experts.

Main Takeaways

AIDE effectively trades computational resources for performance by systematically exploring the code space, achieving SOTA results on MLE-Bench.
The tree-search approach prevents error propagation and context overflow common in linear agent workflows, enabling sustained improvement over long time windows (24 hours).
While dominant in isolated ML tasks (Kaggle), AIDE struggles with tasks requiring broad codebase navigation or multi-step logic changes (e.g., Rust contests).
Surpasses human experts in specialized optimization tasks like writing Triton kernels, where rapid iteration in code space offers a superhuman advantage.

📚 Prerequisite Knowledge

Prerequisites

Machine Learning Engineering (pipelines, cross-validation)
Search Algorithms (Tree Search, Breadth-First/Depth-First)
Large Language Models (Prompting, Context Windows)

Key Terms

AutoML: Automated Machine Learning—tools that automatically select models and hyperparameters (e.g., H2O, AutoSklearn) usually within a fixed search space.

ReACT: Reasoning and Acting—a paradigm where agents generate reasoning traces and task-specific actions in an interleaved manner.

POMDP: Partially Observable Markov Decision Process—a framework used by many agents where the agent optimizes rewards based on a history of observations, often leading to long context requirements.

Pass@k: A metric measuring the probability that at least one of the top k generated solutions is correct or achieves a certain threshold.

Triton Kernel: A language and compiler for writing highly efficient custom Deep Learning primitives for GPUs.

Search Policy: A set of rules determining which node in the solution tree to expand next (e.g., prioritize debugging recent failures vs. improving the best solution).

Stateless Objective: An evaluation function that depends only on the current solution code, not on the history of how it was generated.