← Back to Paper List

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada, R. Lange, Cong Lu, Shengran Hu, Chris Lu, J. Foerster, Jeff Clune, David Ha
Sakana AI, University of British Columbia, Vector Institute, FLAIR, University of Oxford, Canada CIFAR AI Chair
arXiv.org (2025)
Agent Reasoning MM Benchmark

📝 Paper Summary

Self-evolving Agentic reasoning Multi-call tool use with flexible plan
The AI Scientist-v2 is an autonomous agentic system that uses tree-search exploration and VLM feedback to generate scientific papers, achieving the first AI-generated acceptance at a machine learning workshop.
Core Problem
Previous automated science systems relied on human-authored code templates and linear, shallow experimentation, limiting their autonomy and ability to explore complex hypotheses deeply.
Why it matters:
  • Current AI research assistants still require significant human scaffolding (e.g., specific codebases) to function, limiting scalability
  • Linear experimentation fails to capture the iterative nature of science, where hypotheses must be refined, debugged, and expanded based on intermediate results
  • Demonstrating fully autonomous peer-review acceptance marks a critical milestone in AI's ability to contribute directly to human knowledge generation
Concrete Example: In v1, a human had to write a template for a specific topic (e.g., 'transformers') for the AI to modify. In v2, the system starts from a blank slate or generic prompt, downloads datasets, and writes all code from scratch, successfully debugging errors like 'tensor shape mismatch' via tree search.
Key Novelty
Agentic Tree Search for Automated Discovery
  • Replaces linear workflows with a tree search where nodes represent experimental states (code, results); the system expands promising nodes (refining ideas) and backtracks from errors (debugging)
  • Integrates a 'Experiment Progress Manager' that explicitly transitions through scientific stages: feasibility check → hyperparameter tuning → core agenda → ablation studies
  • Incorporates Vision-Language Models (VLMs) as critics to visually inspect generated plots during experiments and refine manuscript figures
Evaluation Highlights
  • Achieved an average reviewer score of 6.33/10 at the ICLR 2025 'I Can't Believe It's Not Better' workshop
  • Ranked in the top 45% of all submissions to the workshop with scores of 6, 7, and 6
  • Passed peer review to become the first fully AI-generated manuscript accepted at a recognized machine learning venue (later withdrawn per protocol)
Breakthrough Assessment
9/10
While the science produced is workshop-level (not top-tier conference), the system architecture enabling fully autonomous, template-free discovery and successful peer review is a landmark technical and functional achievement.
×