From AI for Science to Agentic Science: A Survey on Autonomous Scientific Discovery

📝 Paper Summary

Autonomous scientific discovery Agentic AI Scientific Large Language Models (Sci-LLMs)

The paper formalizes 'Agentic Science' as a distinct evolutionary stage of AI for Science, proposing a unified framework that connects foundational capabilities (reasoning, tools, memory) to autonomous discovery workflows across natural science domains.

Core Problem

Existing research on autonomous scientific discovery is fragmented, treating LLM capabilities, research processes, and autonomy levels in isolation without a unified framework.

Why it matters:

AI is shifting from passive computational tools to active research partners, but lack of a structured paradigm hinders systematic design of these agents
Current surveys focus only on one aspect (e.g., just the process or just the tools), missing the holistic connection between core cognitive capabilities and domain-specific realizations
Rapid progress in separate fields (biology, physics, etc.) needs synthesis to identify common challenges like reproducibility and human-agent collaboration

Concrete Example: A traditional AI model might predict a protein structure (Level 1), but cannot independently hypothesize why that structure matters, design a wet-lab experiment to test it, or refine the hypothesis based on results—capabilities required for true 'Agentic Science' (Level 3).

Key Novelty

Unified Three-Level Framework for Agentic Science

Formalizes the evolution of AI for Science into distinct levels: from Computational Oracles (tools) to Automated Assistants (partial autonomy) to Autonomous Partners (full agency)
Proposes a 'Comprehensive Framework' connecting three layers: (1) Foundational Capabilities (reasoning, memory), (2) Core Processes (hypothesis, experiment), and (3) Domain Realizations
Integrates previously fragmented perspectives (process-oriented, autonomy-oriented, mechanism-oriented) into a single domain-oriented review structure

Architecture

The Comprehensive Framework of Agentic Science, connecting Foundational Capabilities (bottom), Core Processes (middle), and Domain Realizations (top)

Evaluation Highlights

Review spans 4 major domains (Life Sciences, Chemistry, Materials, Physics) and over a dozen subfields
Identifies 5 core capabilities: Reasoning/Planning, Tool Integration, Memory, Multi-Agent Collaboration, and Optimization/Evolution
Categorizes existing systems into levels, distinguishing between Level 2 (Automated Assistants) and Level 3 (Autonomous Partners) systems like Coscientist and ChemCrow

Breakthrough Assessment

9/10

This is a foundational survey that defines the lexicon and structure for the emerging field of Agentic Science. It unifies scattered developments into a coherent paradigm.

⚙️ Technical Details

Problem Definition

Setting: Autonomous scientific discovery where an agent interacts with a scientific environment to maximize knowledge gain

Inputs: High-level research goal G, initial knowledge base K, available toolset T

Outputs: Refined hypotheses H, experimental evidence E, and updated knowledge K'

Pipeline Flow

Observation & Hypothesis Generation (Agent proposes theory based on data)
Experimental Planning & Execution (Agent designs/runs tests via tools)
Data & Result Analysis (Agent interprets outcomes)
Synthesis, Validation & Evolution (Agent refines theory/knowledge)

System Modules

Reasoning & Planning Engine (Foundational Capabilities)

Decompose complex scientific problems into executable steps

Model or implementation: LLM (e.g., GPT-4, Claude) or specialized Sci-LLM

Tool Integration Interface (Foundational Capabilities)

Connect LLM reasoning to domain-specific executables (simulators, wet labs)

Model or implementation: API wrappers / Function calling

Memory Mechanism (Foundational Capabilities)

Store and retrieve long-term scientific knowledge and past experiment results

Model or implementation: Vector Database / Knowledge Graph

Novel Architectural Elements

The 'Comprehensive Framework' unifies capabilities, processes, and domains into a single hierarchical model for Agentic Science
Formalization of the 4-level evolution of AI for Science (Oracle -> Assistant -> Partner -> Architect)

Modeling

Base Model: Various (survey covers many systems including GPT-4, LLaMA-based Sci-LLMs, Claude, etc.)

Training Method: Survey covers various methods: Continued Pre-training, Supervised Fine-Tuning (SFT), Reinforcement Learning (RL)

Objective Functions:

Purpose: Minimize prediction error on scientific tasks.

Formally: Minimize empirical risk L_task over dataset D.
Purpose: Maximize cumulative scientific utility (Level 3 agents).

Formally: Maximize expected information gain I(.) about hypotheses H over infinite horizon.
Purpose: Generate new frameworks (Level 4 agents).

Formally: Maximize generative potential Phi of new framework f_new.

Adaptation: Domain-specific adaptation (e.g., ChemLLM, Darwin Series)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Process-oriented: Integrates cognitive capabilities and domain specifics, not just workflow steps
vs. Autonomy-oriented: Provides deep domain-specific categorization (Life Sci, Chem, etc.) rather than just generic autonomy scales
vs. Mechanism-oriented: Connects mechanisms to specific scientific discovery outcomes and the 'Agentic Science' evolution paradigm

Limitations

Reproducibility crisis: Many agentic systems rely on closed-source models (GPT-4) or proprietary labs
Hallucination risks: Scientific agents may generate plausible but incorrect hypotheses or citations
Safety and Ethics: Autonomous experimentation (especially in bio/chem) poses dual-use risks (e.g., pathogen design)
Evaluation difficulties: Lack of standardized benchmarks for open-ended scientific discovery

Reproducibility

Code: https://github.com/AgenticScience/Awesome-Agent-Scientists

The paper is a survey; it provides a curated list of resources at https://github.com/AgenticScience/Awesome-Agent-Scientists. Reproducibility of individual surveyed papers varies.

📊 Experiments & Results

Evaluation Setup

Qualitative survey and synthesis of existing literature across four major scientific domains

Benchmarks:

N/A (Survey Paper) (Literature Review)

Metrics:

Statistical methodology: Not explicitly reported in the paper

Experiment Figures

The evolution of AI for Science through four levels: Computational Oracle -> Automated Assistant -> Autonomous Partner -> Generative Architect

Main Takeaways

AI for Science is transitioning from 'Tool' (Level 1) to 'Partner' (Level 3), driven by LLMs and agentic architectures
Five core capabilities define these agents: Reasoning, Tools, Memory, Collaboration, and Evolution
Applications are rapidly expanding: from autonomous chemical synthesis (Coscientist) to therapeutic target discovery (OriGene)
Future success requires addressing 'Safety' (dual-use), 'Reliability' (hallucinations), and 'Human-Agent Collaboration' (trust)

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of Large Language Models (LLMs) and their capabilities (reasoning, tool use)
Familiarity with the scientific method (hypothesis, experiment, analysis)
Knowledge of AI applications in specific scientific domains (e.g., AlphaFold for biology)

Key Terms

Sci-LLMs: Scientific Large Language Models—models adapted for science via fine-tuning on scientific data or integration with scientific tools

Agentic Science: A stage of AI where systems act as autonomous partners capable of the full discovery cycle (hypothesis to analysis) with minimal human guidance

Level 1 (Computational Oracle): AI as a specialized tool/function approximator for specific tasks (e.g., protein folding prediction) without autonomy

Level 2 (Automated Research Assistant): AI that executes pre-defined sub-goals or workflows (e.g., running a simulation pipeline) but lacks high-level strategic direction

Level 3 (Autonomous Scientific Partner): AI that independently conducts the discovery loop, including hypothesis generation and experimental design

Level 4 (Generative Architect): Hypothetical future AI capable of inventing new scientific frameworks, instruments, or paradigms

RAG: Retrieval-Augmented Generation—enhancing model outputs by retrieving relevant information from external knowledge bases

CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps

SMILES: Simplified Molecular Input Line Entry System—a text notation for representing chemical structures

PDEs: Partial Differential Equations—mathematical equations describing continuous physical phenomena like fluid dynamics