BioDiscoveryAgent: An AI Agent for Designing Genetic Perturbation Experiments

📝 Paper Summary

AI for Scientific Discovery Biological Experiment Design

BioDiscoveryAgent utilizes Large Language Models and external tools to design iterative genetic perturbation experiments, outperforming Bayesian optimization by leveraging biological prior knowledge and reasoning over experimental results.

Core Problem

Identifying drug targets via genetic perturbation screens is costly because the search space of genes (approx. 19,000) and combinations is vast, while only a small subset yields the desired phenotype.

Why it matters:

Experimentally perturbing every gene is prohibitively expensive and time-consuming
Existing Bayesian optimization methods require training bespoke, opaque models on small datasets and cannot leverage the vast biological knowledge in scientific literature
Misidentification of drug targets is a major cause of clinical trial failure

Concrete Example: A perturbation screen typically targets ~19,000 genes, but only a handful may affect cell growth (the phenotype). Testing all of them is inefficient; existing ML methods struggle with the 'cold start' problem before data is collected, whereas human experts use literature knowledge to pick initial targets.

Key Novelty

LLM-driven Closed-Loop Experiment Design Agent

Replaces specialized acquisition functions (like in Bayesian Optimization) with an LLM that directly suggests genes to perturb based on prompts containing task descriptions and prior results
Integrates tool use (literature search, database queries, AI critic) to ground predictions in existing biological knowledge rather than just statistical patterns

Architecture

The workflow of BioDiscoveryAgent involving the User, the Agent (LLM), and Tools.

Evaluation Highlights

+21% improvement in predicting relevant genetic perturbations (hits) across six datasets compared to Bayesian optimization baselines
+46% improvement in the harder task of identifying non-essential gene hits, which are biologically more informative than essential genes
+170% improvement over random baselines in the novel task of predicting 2-gene combinatorial perturbations

Breakthrough Assessment

8/10

Significant advance in applying agents to real-world scientific discovery. Demonstrates that general-purpose LLMs can outperform specialized ML models in experimental design by leveraging semantic knowledge, not just numerical optimization.

⚙️ Technical Details

Problem Definition

Setting: Closed-loop experimental design where an agent selects a batch of genes to perturb at each round to maximize cumulative discovered hits

Inputs: Task description, biological hypothesis, results from previous experimental rounds (genes perturbed and their phenotypic response)

Outputs: A list of genes (or gene pairs) to perturb in the next experimental round

Pipeline Flow

Prompt Construction (Task + History) → Tool Use (Optional) → LLM Inference → Gene Selection
Experimental Simulation (Retrieve ground truth) → Update History → Repeat

System Modules

Prompt Constructor

Integrates task description, hypothesis, and summarized results from previous rounds into a structured prompt

Model or implementation: Rule-based text formatting

Tool Executor (Optional)

Queries external sources to augment the context before final prediction

Model or implementation: APIs (PubMed, Reactome) or Auxiliary LLM (Critic)

Gene Selector

Predicts the next batch of genes to perturb

Model or implementation: Claude 3.5 Sonnet (primary), Claude 3 Haiku, GPT-4o, etc.

Novel Architectural Elements

Integration of an 'AI Critic' agent that reviews and modifies the main agent's gene selection plan before finalization
Dynamic fallback mechanism: allows free-form gene suggestion first, but switches to selecting from a summarized 'remaining genes' list if the model hallucinates or repeats invalid genes

Modeling

Base Model: Claude 3.5 Sonnet (best performing)

Compute: Inference only (no training). Costs reported: Claude 3.5 Sonnet ($15/1M output tokens), Claude 3 Haiku ($1.25/1M output tokens).

Comparison to Prior Work

vs. Bayesian Optimization: BioDiscoveryAgent uses natural language and biological prior knowledge rather than just numerical function approximation
vs. Coreset: BioDiscoveryAgent reasons about phenotypic mechanisms rather than just feature space diversity
vs. Random: BioDiscoveryAgent strategically navigates the hypothesis space using reasoning

Limitations

Dependency on closed-source LLMs (Claude, GPT) creates cost and reproducibility issues
Performance varies significantly between different LLM families and sizes
Context window limits require summarization of large gene lists or experimental histories, potentially losing information
One evaluation dataset (CAR-T) is unpublished and unavailable for external verification

Reproducibility

Not provided: Code availability is not explicitly stated in the paper text or abstract. One dataset (CAR-T) is unpublished. Prompt templates are described conceptually in Appendix.

📊 Experiments & Results

Evaluation Setup

Simulated closed-loop genetic screens using retrospective data from 6 datasets (5 published, 1 unpublished)

Benchmarks:

Schmidt et al. (2022) (T-cell cytokine production (IFNG, IL-2))
Carnevale et al. (2022) (T-cell resistance to tumor microenvironment)
CAR-T (Unpublished) (CAR-T cell proliferation) [New]
Scharenberg et al. (2023) (Lysosomal choline recycling)
Sanchez et al. (2021) (Tau protein levels in neurons)
Horlbeck et al. (2018) (2-gene combinatorial perturbation (synergy))

Metrics:

Hit ratio (fraction of total true hits discovered)
Non-essential gene hit ratio
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on 1-gene perturbation experiments across 6 datasets after 5 rounds (128 genes/round).
Average across 6 datasets	Hit Ratio Improvement (%)	Not reported in the paper	Not reported in the paper	+21%
Average across 6 datasets	Non-essential Hit Ratio Improvement (%)	Not reported in the paper	Not reported in the paper	+46%
Performance on the 2-gene combinatorial perturbation task (Horlbeck et al., 2018).
Horlbeck et al. (2018)	Hit Ratio Improvement (%)	Not reported in the paper	Not reported in the paper	+170%

Main Takeaways

BioDiscoveryAgent (Claude 3.5 Sonnet) consistently outperforms both random and Bayesian optimization baselines across diverse biological datasets
The agent is particularly effective at the 'cold start' (early rounds), leveraging biological knowledge before sufficient data is collected for ML baselines
Tool use (literature search, critic, gene search) showed mixed results; for the strongest model (Claude 3.5 Sonnet), tools did not significantly boost performance, suggesting the model's internal knowledge is sufficient
The agent successfully generalizes to the much larger search space of combinatorial (2-gene) perturbations, where it doubles the performance of random baselines

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of Large Language Models (LLMs)
Familiarity with Bayesian Optimization and acquisition functions
Knowledge of CRISPR-based genetic screens and drug target discovery

Key Terms

genetic perturbation: Modifying a gene (e.g., repressing or activating it) to observe the resulting change in biological traits (phenotype)

phenotype: The observable physical properties of an organism or cell, such as cell growth or protein levels

hit: A gene that, when perturbed, produces a phenotypic response exceeding a specific threshold

hit ratio: The fraction of true hits discovered cumulatively over the course of the experiment relative to the total number of true hits

cold start problem: The difficulty of making good predictions at the beginning of an experiment when no data has yet been collected

essential genes: Genes critical for cell survival; perturbing them almost always causes a strong effect, making them 'easy' but often uninformative targets

Reactome: A curated database of biological pathways and processes used here to find genes with similar functions

Bayesian optimization: A strategy for global optimization of black-box functions, typically used here as a baseline for selecting experiments

acquisition function: In Bayesian optimization, a function that guides the search by balancing exploration (finding new info) and exploitation (using known info)

CRISPR: A technology used to selectively modify the DNA of living organisms, used here to perform the perturbations

Coreset: A specific active learning or batch selection strategy used as a baseline, focusing on selecting a diverse set of representative samples