The Virtual Lab: AI Agents Design New SARS-CoV-2 Nanobodies with Experimental Validation

📝 Paper Summary

Multi-agent scientific discovery Protein design / Nanobody engineering

The Virtual Lab is an AI system where a team of specialized LLM agents collaboratively designs, implements, and executes a computational pipeline to create effective new nanobodies for COVID-19.

Core Problem

Interdisciplinary research requires coordinating diverse experts (e.g., virologists, ML engineers), which is resource-intensive and slow, hindering rapid responses to evolving threats like SARS-CoV-2 variants.

Why it matters:

Rapidly evolving viruses develop resistance to existing therapies, creating an urgent need for fast, automated design of binders for new variants like KP.3
Most scientists lack immediate access to large, diverse teams of experts, limiting the complexity of research they can undertake alone
Current LLM tools (e.g., ChemCrow) handle narrow tasks but struggle with open-ended, multi-step research design requiring high-level reasoning across fields

Concrete Example: A human biologist wants to design nanobodies for a new variant but doesn't know how to code the latest ML models (like ESM or AlphaFold). Currently, they must hire a computational specialist or learn it themselves. The Virtual Lab automatically spawns a 'Computational Biologist' agent to write the code and an 'Immunologist' agent to guide the biological strategy.

Key Novelty

Virtual Lab: Collaborative AI-Human Research Framework

Simulates a research group where a Principal Investigator (PI) agent spawns specialized scientist agents (e.g., Immunologist, ML Specialist) based on the project description
Agents hold structured meetings (Team vs. Individual) to debate agendas, write code, and critique each other's work under human supervision
Integrates high-level reasoning (LLMs) with specialized computational tools (AlphaFold, Rosetta) to perform end-to-end research from ideation to execution

Evaluation Highlights

92 AI-designed nanobodies were experimentally synthesized; 90% expressed soluble protein, showing the designs were biologically viable
Two novel nanobodies showed improved binding to the recent JN.1 or KP.3 variants while retaining binding to the ancestral strain, validating the design pipeline
Human intervention was minimal: The human researcher wrote only ~1.3% of the text (1,596 words) while AI agents generated 98.7% (122,462 words) including all code

Breakthrough Assessment

9/10

Demonstrates a complete loop from high-level AI research planning to wet-lab experimental validation with successful hits. Moves beyond 'LLMs writing code' to 'LLMs conducting science'.

⚙️ Technical Details

Problem Definition

Setting: Automated design of nanobody sequences $x'$ that bind to a target antigen (SARS-CoV-2 KP.3 RBD) by mutating improved variants from starting sequences $x$

Inputs: High-level natural language research goal (e.g., 'Design nanobodies for KP.3 variant') provided by human

Outputs: List of candidate nanobody sequences for experimental synthesis

Pipeline Flow

Team Selection (PI Agent creates Scientist Agents)
Project Specification (Team Meeting to define goals/strategy)
Tools Selection (Team Meeting to pick software)
Tools Implementation (Individual Meetings to write code for ESM, AlphaFold, Rosetta)
Workflow Design (PI Agent defines scoring/ranking logic)
Execution (Running the designed pipeline)

System Modules

Principal Investigator (PI)

Guides project, creates other agents, synthesizes discussions, makes final decisions

Model or implementation: GPT-4o

Scientist Agents

Domain experts (e.g., Immunologist, ML Specialist, Computational Biologist) that debate ideas and write code

Model or implementation: GPT-4o

Scientific Critic

Provides critical feedback to catch errors and bias in other agents' outputs

Model or implementation: GPT-4o

Parallel Meeting Runner

Runs same meeting N times in parallel (high temp) then merges results (low temp)

Model or implementation: GPT-4o

Novel Architectural Elements

Dynamic Agent Generation: PI agent defines the team composition (roles/prompts) on the fly based on the specific problem
Meeting-based Workflow: Explicit 'Team Meetings' (all agents) vs. 'Individual Meetings' (task execution) structure mimicking human labs
Parallel-then-Merge Consensus: Running parallel conversation threads with high temperature (0.8) and merging with a PI agent (temp 0.2) to filter hallucinations and boost creativity

Modeling

Base Model: GPT-4o (powering all agents)

Training Method: Prompt Engineering / In-context Learning (Agents are defined via system prompts)

Compute: Not reported in the paper

Comparison to Prior Work

vs. ChemCrow/Coscientist: Virtual Lab performs higher-level research design (defining the team, choosing the tools) rather than just executing predefined tools
vs. AI Scientist: Virtual Lab integrates wet-lab validation and tackles a physical biology problem rather than just computational experiments
vs. Standard Protein Design (e.g., RFdiffusion) [not cited in paper]: Uses conversational agents to *write* the design pipeline code rather than being the design model itself

Limitations

Relies on proprietary LLMs (GPT-4o), limiting reproducibility and cost control
Agent-designed pipeline required minor bug fixes (handled by a follow-up 'fix' meeting), showing code generation isn't perfect
Resulting nanobodies had moderate affinity improvements; no 'super-binder' was instantly found
Wet-lab validation is slow and decoupled from the fast computational loop

Reproducibility

Code availability is not explicitly provided in the paper text. Full prompts for agents and meetings are included in the Appendix. The paper relies on closed-source GPT-4o. Biological validation protocols are described.

📊 Experiments & Results

Evaluation Setup

Design of nanobodies binding to SARS-CoV-2 KP.3 RBD followed by wet-lab synthesis and binding assays

Benchmarks:

Experimental Validation (ELISA) (Protein Binding Assay) [New]

Metrics:

Soluble expression rate
Binding affinity (ELISA signal)
Specificity (binding to target vs. controls)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Experimental validation confirms the AI-designed nanobodies are structurally sound and functional.
Wet Lab Synthesis	Percentage Soluble/Expressed	Not reported in the paper	90	Not reported in the paper
ELISA (Wuhan RBD)	Retention of Binding	Not reported in the paper	Not reported in the paper	Not reported in the paper
ELISA (JN.1/KP.3 RBD)	Binding Signal	0.1	3.5	+3.4

Main Takeaways

Virtual Lab successfully coordinated a multi-step design process: creating agents, defining a strategy, writing code for 3 distinct tools, and executing the workflow.
The agent-designed scoring function (Weighted Score of ESM, AlphaFold, Rosetta) successfully identified stable, soluble proteins (90% success rate).
Two specific hits (from Nb21 and Ty1 lineages) gained binding to new variants (JN.1/KP.3) while preserving ancestral binding, a difficult multi-objective optimization.
The system is highly autonomous: Human intervention was minimal (~1% of tokens), mostly for high-level guidance.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Model (LLM) agent frameworks
Familiarity with protein structure prediction and design
Basic knowledge of antibody/nanobody biology

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

Nanobody: A small, single-domain antibody fragment (derived from camelids) that is stable and easier to produce than full antibodies

RBD: Receptor Binding Domain—the part of the viral spike protein that binds to host cells; a key target for neutralizing antibodies

ESM: Evolutionary Scale Modeling—a protein language model used here to estimate the evolutionary likelihood of mutations (used as a proxy for fitness)

AlphaFold-Multimer: An AI system that predicts the 3D structure of protein complexes (e.g., nanobody bound to spike protein)

Rosetta: A biophysical software suite used to calculate binding energies (dG) of protein interfaces

LLR: Log-Likelihood Ratio—a score from ESM comparing the probability of a mutated sequence vs. the original sequence

pLDDT: predicted Local Distance Difference Test—a confidence metric from AlphaFold; ipLDDT specifically measures confidence at the binding interface

ELISA: Enzyme-Linked Immunosorbent Assay—a wet-lab experiment to measure how well an antibody binds to its target antigen

SFT: Supervised Fine-Tuning (general concept, though this paper uses prompt engineering rather than SFT)

KP.3 / JN.1: Specific recent variants of the SARS-CoV-2 virus (Omicron sub-lineages)

Zero-shot: Using a model to perform a task without specific training examples (used here for ESM scoring)