← Back to Paper List

AIDE: AI-Driven Exploration in the Space of Code

Z. Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, Yuxiang Wu
Weco AI
arXiv.org (2025)
Agent Benchmark Reasoning

📝 Paper Summary

Self-evolving Agentic reasoning Multi-call tool use with flexible plan
AIDE automates machine learning engineering by modeling the trial-and-error process as a tree search over Python scripts, using LLMs to iteratively draft, debug, and improve code based on performance feedback.
Core Problem
Developing high-performance ML models requires labor-intensive trial-and-error (debugging, tuning, architectural changes) that standard agents struggle to manage due to long context windows and lack of structured exploration.
Why it matters:
  • Traditional AutoML is limited to predefined configuration spaces (hyperparameters) and cannot innovate on code structure or data processing like human engineers.
  • ReACT-style agents simply append history to the context, leading to context overflow and incoherent optimization attempts over long horizons.
  • Human engineers spend significant time on tedious iteration rather than conceptualizing high-level research hypotheses.
Concrete Example: In a Kaggle competition, a standard agent might generate a script, encounter a tensor shape mismatch, and fail to fix it because the error log is lost in a long conversation history. AIDE would treat the buggy script as a node in a tree, specifically target it with a 'Debug' operation using only relevant logs, and spawn a corrected child node.
Key Novelty
Tree Search in the Space of Code (AIDE)
  • Frames ML engineering as a discrete optimization problem where every state is a standalone Python script and actions are code edits (Draft, Debug, Improve).
  • Replaces monolithic conversation history with a solution tree, using a 'Summarization Operator' to compress past attempts into concise hints for the next step, keeping prompts short and focused.
  • Decouples the search policy (which solution to work on next) from the coding capability (how to write the code), allowing for systematic exploration rather than linear greedy steps.
Evaluation Highlights
  • Achieves 36.4% medal rate on MLE-Bench Lite with o1-preview, a nearly 5x improvement over the model alone (7.6%) and significantly outperforming OpenHands.
  • Surpasses 51.38% of human competitors on average across 16 tabular Kaggle competitions, outperforming H2O AutoML (35.34%) and AutoGPT (32.34%).
  • Outperformed 9 human experts in the 'Optimize a Kernel' task (Triton kernel optimization) on RE-Bench by discovering a solution faster than any human.
Breakthrough Assessment
9/10
AIDE demonstrates a step-change in automated engineering, moving from parameter tuning to actual code evolution. Its dominance on MLE-Bench and ability to beat humans on complex coding tasks (Triton kernels) marks a significant advance in agentic coding.
×