Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents

📝 Paper Summary

Automated code generation Scientific discovery agents Tool-use post-training

Paper2Agent serves as an automated framework that transforms static research papers and codebases into interactive AI agents by building verified Model Context Protocol (MCP) servers.

Core Problem

Research papers are passive artifacts; reproducing computational methods requires substantial effort to locate code, install complex dependencies, and understand API hierarchies, creating barriers to adoption.

Why it matters:

Biologists and non-experts cannot easily leverage advanced computational tools (e.g., AlphaGenome) due to technical setup barriers
Static code repositories often require significant manual adaptation to work on new data
Existing 'executable papers' or notebooks still require technical familiarity to configure and run successfully

Concrete Example: To use AlphaGenome, a user must normally install environments, import modules, manage API keys, and construct specific chromosome objects. With Paper2Agent, a user simply asks 'Generate AlphaGenome predictions for these variants,' and the agent handles the underlying API complexity automatically.

Key Novelty

Automated Paper-to-MCP Conversion Pipeline

Systematically analyzes a paper's text and codebase using specialized agents (Environment, Extraction, Testing) to construct a Model Context Protocol (MCP) server
Validates generated tools against the paper's reported results to lock in reproducibility and prevent 'code hallucination' before exposing them to users

Architecture

The Paper2Agent framework workflow, identifying the codebase, invoking construction agents, validating via testing, and deploying as an MCP server.

Evaluation Highlights

100.0% accuracy on novel AlphaGenome queries (unseen variants/tissues), significantly outperforming Claude + Repo (80.0%) and Biomni (60.0%)
Median runtime reduced by 3.2x compared to Claude + Repo and 4.6x compared to Biomni on novel benchmark tasks
Automatically reproduced original paper results for TISSUE (prediction intervals) and Scanpy (clustering workflows) without human intervention

Breakthrough Assessment

9/10

Introduces a paradigm shift from static PDFs to interactive agents using the industry-standard MCP. Demonstrates high reliability (100% accuracy) and enables automated scientific collaboration.

⚙️ Technical Details

Problem Definition

Setting: Automated conversion of unstructured scientific artifacts (PDF + Codebase) into structured, executable agent interfaces

Inputs: Research paper PDF and associated code repository

Outputs: A verified Model Context Protocol (MCP) server containing executable tools, resources, and prompts

Pipeline Flow

Codebase Identification
Group: Construction (Environment Agent → Extraction Agent)
Group: Validation (Testing Agent)
Deployment (MCP Server)

System Modules

Environment Agent (Construction)

Configures the necessary software environment and dependencies for the paper's code

Model or implementation: Claude Code

Extraction Agent (Construction)

Translates core methods from the paper into implemented MCP tools

Model or implementation: Claude Code

Testing Agent

Iteratively generates and runs tests to refine tools until they match reference results

Model or implementation: Claude Code

MCP Server

Hosts the validated tools, resources, and prompts as a standardized interface for downstream agents

Model or implementation: N/A (Software Artifact)

Novel Architectural Elements

Automated validation loop where a Testing Agent checks extracted tools against the paper's reported numbers/figures before deployment
Transformation of static paper artifacts into standardized MCP (Model Context Protocol) components (Tools, Resources, Prompts)

Modeling

Base Model: Claude Code (used as the underlying engine for the Paper2Agent construction process)

Compute: AlphaGenome MCP tools generated in ~3 hours on a personal laptop. Scanpy tools generated in ~45 minutes.

Comparison to Prior Work

vs. Biomni: Paper2Agent creates specialized agents verified against specific papers, achieving higher accuracy (100% vs 40-60%)
vs. Executable Papers: Paper2Agent enables natural language interaction and autonomous execution rather than just manual code running
vs. Virtual Lab: Focuses specifically on converting *existing* literature into agents, rather than just general problem solving

Limitations

Relies on the availability and quality of the original paper's codebase
Construction time varies by complexity (3 hours for AlphaGenome vs 45 mins for Scanpy)
Depends on the capabilities of the underlying LLM (Claude Code) for code comprehension

Reproducibility

Paper2Agent validates tools against original codebases (AlphaGenome, TISSUE, Scanpy) during construction. Code references are embedded in tools for traceability. The framework creates reproducible artifacts by 'locking' valid code.

📊 Experiments & Results

Evaluation Setup

Validation of generated agents on reproduction tasks and novel queries using established scientific methods

Benchmarks:

AlphaGenome Benchmark (Genomic variant interpretation (Tutorial queries & Novel queries)) [New]
TISSUE Benchmark (Spatial transcriptomics prediction interval construction) [New]
Scanpy Benchmark (Single-cell preprocessing and clustering) [New]

Metrics:

Accuracy (correct execution and result)
Runtime efficiency
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
AlphaGenome agent performance comparison on manual tutorial queries and novel, unseen queries.
AlphaGenome Benchmark (Tutorial Queries)	Accuracy	60.0%	100.0%	+40.0%
AlphaGenome Benchmark (Novel Queries)	Accuracy	80.0%	100.0%	+20.0%
AlphaGenome Benchmark (Novel Queries)	Runtime Improvement (vs Claude + Repo)	1.0x (Reference)	3.2x faster	3.2x speedup
AlphaGenome Benchmark (Tutorial Queries)	Runtime Improvement (vs Claude + Repo)	1.0x (Reference)	1.8x faster	1.8x speedup

Experiment Figures

AlphaGenome case study results, including the suite of tools generated, accuracy benchmarks, and runtime comparisons.

Main Takeaways

Paper2Agent achieves 100% accuracy on AlphaGenome tasks where general coding agents fail, due to its validation-locked tools
Significant efficiency gains (up to 4.6x speedup) because the agent uses pre-built tools rather than writing code from scratch for every query
Demonstrated ability to facilitate novel scientific discovery: The agent correctly prioritized causal variants (rs1626703) in ADHD GWAS data by collaborating with a data agent

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM-based agents and tool use
Familiarity with scientific Python ecosystems (genomics, single-cell analysis)
Basic knowledge of API protocols

Key Terms

MCP: Model Context Protocol—a standard that enables AI agents to connect to external data and tools via a unified interface

AlphaGenome: A genome-scale foundation model used as a primary case study for predicting the impact of genetic variants

GWAS: Genome-Wide Association Study—an observational study of a genome-wide set of genetic variants to see if any variant is associated with a trait

eQTL: Expression Quantitative Trait Loci—genomic loci that explain variation in expression levels of mRNAs

TISSUE: A method for uncertainty-aware single-cell spatial transcriptomics analysis used as a case study

Scanpy: A comprehensive toolkit for analyzing single-cell gene expression data

code hallucination: When an LLM generates code that looks syntactically correct but calls non-existent functions or produces scientifically invalid results

MCP Tools: Executable functions encapsulating a paper's methods (e.g., score_variant_effect)

MCP Resources: Static assets like datasets, figures, or manuscript text exposed via the protocol

MCP Prompts: Pre-defined templates that guide agents through complex multi-step workflows (e.g., standard preprocessing pipelines)