The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

📝 Paper Summary

Self-evolving Agentic reasoning Multi-call tool use with flexible plan

The AI Scientist-v2 is an autonomous agentic system that uses tree-search exploration and VLM feedback to generate scientific papers, achieving the first AI-generated acceptance at a machine learning workshop.

Core Problem

Previous automated science systems relied on human-authored code templates and linear, shallow experimentation, limiting their autonomy and ability to explore complex hypotheses deeply.

Why it matters:

Current AI research assistants still require significant human scaffolding (e.g., specific codebases) to function, limiting scalability
Linear experimentation fails to capture the iterative nature of science, where hypotheses must be refined, debugged, and expanded based on intermediate results
Demonstrating fully autonomous peer-review acceptance marks a critical milestone in AI's ability to contribute directly to human knowledge generation

Concrete Example: In v1, a human had to write a template for a specific topic (e.g., 'transformers') for the AI to modify. In v2, the system starts from a blank slate or generic prompt, downloads datasets, and writes all code from scratch, successfully debugging errors like 'tensor shape mismatch' via tree search.

Key Novelty

Agentic Tree Search for Automated Discovery

Replaces linear workflows with a tree search where nodes represent experimental states (code, results); the system expands promising nodes (refining ideas) and backtracks from errors (debugging)
Integrates a 'Experiment Progress Manager' that explicitly transitions through scientific stages: feasibility check → hyperparameter tuning → core agenda → ablation studies
Incorporates Vision-Language Models (VLMs) as critics to visually inspect generated plots during experiments and refine manuscript figures

Evaluation Highlights

Achieved an average reviewer score of 6.33/10 at the ICLR 2025 'I Can't Believe It's Not Better' workshop
Ranked in the top 45% of all submissions to the workshop with scores of 6, 7, and 6
Passed peer review to become the first fully AI-generated manuscript accepted at a recognized machine learning venue (later withdrawn per protocol)

Breakthrough Assessment

9/10

While the science produced is workshop-level (not top-tier conference), the system architecture enabling fully autonomous, template-free discovery and successful peer review is a landmark technical and functional achievement.

⚙️ Technical Details

Problem Definition

Setting: End-to-end generation of a scientific manuscript M from a high-level topic prompt T, covering hypothesis H, experiments E, and analysis A.

Inputs: A high-level research topic (e.g., 'compositional generalization') or theme

Outputs: A complete, compiled PDF manuscript comprising abstract, introduction, method, experiments, and conclusion

Pipeline Flow

Idea Generation (Iterative brainstorming + Literature Search)
Experiment Progress Manager (Controls stages: 1. Preliminary -> 2. Tuning -> 3. Agenda -> 4. Ablation)
Manuscript Writing (Text generation + VLM-based refinement)
Review & Compilation (Latex compilation)

System Modules

Idea Generator

Formulate hypotheses and check novelty against existing literature

Model or implementation: Unspecified LLM (likely Claude 3.5 Sonnet or similar based on v1/context)

Experiment Progress Manager (Experimentation)

Orchestrates the transition between research stages (Feasibility, Tuning, Agenda, Ablation) and budget management

Model or implementation: Unspecified LLM

Tree Search Explorer (Experimentation)

Execute code generation, running, and debugging within a specific stage

Model or implementation: Unspecified LLM

VLM Critic

Evaluate generated plots for clarity, correctness, and aesthetics

Model or implementation: Unspecified VLM

Manuscript Author

Write the final paper based on aggregated results

Model or implementation: Reasoning model (e.g., o1) mentioned for reflection

Novel Architectural Elements

Experiment Progress Manager acting as a state machine over the tree search (controlling the transition from Feasibility → Tuning → Agenda → Ablation)
Specialized Tree Nodes: Explicit distinction between 'Hyperparameter nodes', 'Ablation nodes', 'Replication nodes', and 'Aggregation nodes' within the search tree
VLM-in-the-loop for automated visual verification of experimental plots before they are accepted into the tree

Modeling

Base Model: See Appendix A (Not explicitly detailed in main text, but implies use of SOTA LLMs like Claude/GPT-4o/o1)

Compute: Not reported in the paper

Comparison to Prior Work

vs. The AI Scientist-v1: Eliminates human code templates (autonomy), replaces linear execution with tree search (exploration depth), adds VLM feedback (visual quality)
vs. AIDE: Adapts tree search specifically for the multi-stage scientific method (hypothesis → ablation) rather than just maximizing a leaderboard metric
vs. Co-Scientist [not cited in paper]: Focuses on ML code/theory generation rather than chemistry/lab automation

Limitations

Inconsistent quality: Only 1 of 3 submitted papers was accepted; system does not consistently reach top-tier conference standards.
Incremental science: Generated ideas are often standard ML variations rather than deep, novel theoretical breakthroughs.
Hallucination: Occasionally introduces inaccuracies in citations or method descriptions (e.g., confusion between embedding vs. hidden states).
Template-free fragility: While template-free, it relies on ad-hoc Hugging Face dataset loading which may not generalize to all data types.

Reproducibility

Code: https://github.com/SakanaAI/AI-Scientist-v2

publicly available (https://github.com/SakanaAI/AI-Scientist-v2). The codebase is open-sourced. The specific ICLR 2025 workshop experiment data is also released. The exact LLM endpoints used are configurable but defaults are provided in the repo.

📊 Experiments & Results

Evaluation Setup

Submission of fully AI-generated manuscripts to the peer-reviewed ICLR 2025 workshop 'I Can't Believe It's Not Better' (ICBINB).

Benchmarks:

ICLR 2025 ICBINB Workshop Peer Review (Scientific Peer Review)

Metrics:

Reviewer Score (1-10)
Acceptance Decision
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ICLR 2025 ICBINB Workshop Peer Review	Reviewer Score	Not reported in the paper	6.33	Not reported in the paper

Main Takeaways

The system successfully produced a workshop-accepted paper on 'Compositional Regularization' that received scores of 6, 7, and 6.
Reviewers valued the negative results and clear identification of challenges, despite noting shortcomings in theoretical justification.
The system autonomously identified a relevant hypothesis (penalizing embedding deviations), executed experiments on synthetic data (SCAN/COGS), and reported that the method failed to improve generalization (a valid scientific negative result).
Limitations in the generated paper included confusion in terminology (embedding vs. hidden states) and insufficient experimental breadth (only LSTM tested).

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Large Language Model (LLM) agents
Understanding of tree search algorithms (e.g., Best-First Search)
Basic knowledge of the machine learning research lifecycle (idea, code, experiment, paper)

Key Terms

Agentic Tree Search: An exploration strategy where an AI agent builds a tree of experimental states, choosing to refine successful branches or debug failed ones based on evaluation scores

VLM: Vision-Language Model—an AI model capable of understanding and generating text based on visual inputs (images)

Experiment Progress Manager: A meta-agent that governs the high-level phases of research (feasibility, tuning, execution, ablation) to ensure structural rigor

Ablation Studies: Experiments that remove specific components of a method to evaluate their individual contributions to performance

Semantic Scholar: A scientific literature search engine used by the system to check idea novelty and find citations

ICLR: International Conference on Learning Representations—a top-tier machine learning conference

Hugging Face Hub: A platform hosting datasets and models, used by the system to autonomously download research data

OOM: Out Of Memory—a common error in deep learning training which the system must debug

Aider: A command-line tool for AI-assisted coding, used in the previous version (v1) but replaced in v2 by a single-pass generation approach