MLE-bench: A benchmark for evaluating AI agents on Machine Learning Engineering tasks sourced from 75 real-world Kaggle competitions
AIDE: An existing state-of-the-art LLM-based agent that uses a tree-search approach to generate and refine ML code
MCTS: Monte Carlo Tree Search—a heuristic search algorithm that balances exploration (finding new paths) and exploitation (refining promising paths) using tree simulations
Crossover: A genetic operator that combines parts of two different parent solutions (codebases) to create a new offspring solution
generalization gap: The difference in model performance between the validation set (used for tuning) and the held-out test set (used for final scoring)
AIRA-dojo: The authors' proposed framework providing isolated, reproducible environments (sandboxes) for executing and evaluating research agents
UCT: Upper Confidence Bound for Trees—a formula used in MCTS to select nodes that maximizes the upper confidence bound of the reward estimate
Medal Success Rate: The percentage of tasks where an agent achieves a score equivalent to a Bronze, Silver, or Gold medal in the original Kaggle competition