Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents

📝 Paper Summary

Self-evolving Agentic reasoning Simulation-based learning Medical AI

Agent Hospital creates a virtual world where doctor agents autonomously evolve medical expertise by treating thousands of synthesized patients, achieving state-of-the-art performance on MedQA without manual labeling.

Core Problem

Medical LLMs acquire knowledge from texts (Phase 1) but lack the clinical experience gained through practice (Phase 2), and real-world trial-and-error for AI is risky and slow.

Why it matters:

Current medical agents rely on static prompting or fine-tuning, failing to model the continuous learning process of human doctors during residency
Annotating expert medical data for supervised learning is expensive and labor-intensive
Directly deploying agents in real hospitals for learning purposes is ethically and practically unfeasible

Concrete Example: A doctor agent using only base LLM knowledge might diagnose 'Herpes Zoster' incorrectly because it lacks specific case experience. In Agent Hospital, after misdiagnosing a similar patient and receiving feedback, the agent reflects, creates a rule (e.g., 'patients >50 are more susceptible'), and retrieves this experience for future accurate diagnoses.

Key Novelty

Simulacrum-based Evolutionary Agent Learning (SEAL)

Constructs a complete hospital simulation where patients, nurses, and doctors are all agents, generating infinite interaction data without human labeling
Enables 'MedAgent-Zero' evolution where doctor agents accumulate a 'Medical Case Base' (successes) and 'Experience Base' (reflections on failures) to improve inference at runtime, rather than updating model weights

Architecture

The conceptual layout of Agent Hospital, showing functional areas (Triage, Consultation, etc.) and the interactions between Patient, Nurse, and Doctor agents.

Evaluation Highlights

Doctor agents improved diagnostic accuracy for Rheumatic Heart Disease from 9% (GPT-3.5 base) to 82% after evolution in the simulacrum
Evolved agents using GPT-4o outperformed state-of-the-art baselines (MedAgents, Medprompt) on the MedQA benchmark without seeing MedQA training data
Demonstrated scaling laws in evolution: accuracy improves logarithmically with the number of simulated patients treated (tested up to 20,000 cases)

Breakthrough Assessment

9/10

Proposes a significant paradigm shift from 'learning from data' to 'learning from simulated practice,' successfully demonstrating Sim-to-Real transfer in a complex cognitive domain (medicine) without manual supervision.

⚙️ Technical Details

Problem Definition

Setting: Task-specific medical decision making (Examination Selection, Diagnosis, Treatment) in a virtual environment with Sim-to-Real transfer

Inputs: Patient agent profile, medical history, and reported symptoms

Outputs: Medical examination requests, diagnosis results, and treatment plans

Pipeline Flow

Patient Generation (LLM + Knowledge Base)
Clinical Interaction (Triage -> Consultation -> Exam)
Doctor Reasoning (Retrieval -> Diagnosis)
Evolutionary Update (Feedback -> Case/Rule Storage)

System Modules

Patient Generator

Synthesize patient profiles, history, and symptoms based on disease probability distributions

Model or implementation: LLM (e.g., GPT-3.5/4) + Medical Knowledge Base

Medical Case Base (Memory & Evolution)

Store successful treatment records to serve as few-shot examples for future cases

Model or implementation: Vector Database (implied for retrieval)

Experience Base (Memory & Evolution)

Store tuning-free rules derived from reflection on past failures

Model or implementation: Text-based Rule Store

Doctor Agent

Perform diagnosis by combining patient info with retrieved cases and rules

Model or implementation: Proprietary LLM (GPT-3.5 or GPT-4o)

Novel Architectural Elements

Dual-memory evolutionary mechanism: Separating 'successful examples' (Case Base) from 'corrected failures' (Experience Base) to guide frozen LLMs
Closed-loop simulacrum: Treating the simulation not just as an environment but as a synthetic data generator that actively verifies and feedbacks into the agent's memory

Modeling

Base Model: GPT-3.5 (for evolution experiments) and GPT-4o (for MedQA benchmarks)

Training Method: In-context learning via dynamic memory population (Evolutionary)

Training Data:

20,000 synthetic patient agents per clinical department
Test set of 200 patient agents per department

Key Hyperparameters:

patient_agents_per_department: 20000
simulated_diseases: 339

Compute: Not reported in the paper

Comparison to Prior Work

vs. MedAgents: MedAgents relies on collaboration between static roles; Agent Hospital enables agents to evolve individually through historical practice
vs. Medprompt: Medprompt optimizes static prompts; Agent Hospital dynamically builds a retrieval base of experience from simulation
vs. Reflexion [not cited in paper]: Reflexion uses short-term memory of failures for a single task instance; Agent Hospital builds a persistent long-term memory (Experience Base) over thousands of distinct patients

Limitations

Dependency on the quality of the underlying LLM and Knowledge Base for creating realistic patient simulations
The simulation assumes medical professionals (agents) never get sick, simplifying the environment
Use of proprietary LLMs (GPT family) limits accessibility and reproducibility compared to open weights
Real-world alignment results are preliminary and rely on correlations with MedQA rather than clinical trials

Reproducibility

Code availability is not explicitly provided in the text. The paper uses proprietary models (GPT-3.5, GPT-4o) as the core 'brain', which may affect exact reproducibility due to API changes. Detailed prompts for patient generation and doctor reasoning are not in the main text.

📊 Experiments & Results

Evaluation Setup

Agents treat patients in a virtual hospital across 32 departments; skills are then tested on static benchmarks.

Benchmarks:

Virtual Hospital Tasks (Diagnosis, Examination Selection, Treatment Recommendation) [New]
MedQA (USMLE) (Medical Question Answering)

Metrics:

Diagnostic Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Virtual Hospital (Cardiology)	Diagnostic Accuracy	9	82	+73
Virtual Hospital (Respiratory)	Diagnostic Accuracy	66	95	+29
MedQA	Accuracy	Not reported in the paper	Not reported in the paper	Not reported in the paper

Experiment Figures

Evolutionary performance curves. (a) Accuracy improvements for specific diseases. (b) Accuracy vs. Number of Patients treated (Scaling Law). (c) MedQA performance comparison.

Main Takeaways

Doctor agents exhibit 'scaling laws' in evolution: diagnostic accuracy increases logarithmically with the number of patients treated, saturating around 10,000-20,000 cases.
The 'Experience Base' (learning from failure) and 'Medical Case Base' (learning from success) allow frozen LLMs to improve significantly without weight updates.
Skills acquired in the virtual simulacrum transfer to the real-world MedQA benchmark, achieving SOTA results without direct training on the benchmark data.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM-based agents
Basic medical diagnostic workflows (triage -> exam -> diagnosis)
Concept of RAG (Retrieval-Augmented Generation)

Key Terms

SEAL: Simulacrum-based Evolutionary Agent Learning—the framework of using a simulated world to generate data and train agents through evolutionary practice

MedAgent-Zero: The specific evolutionary strategy where agents learn entirely from synthetic simulation data without human-labeled examples

Simulacrum: A virtual representation of a real-world environment (here, a hospital) where agents interact and simulate processes

MedQA: A benchmark dataset comprising US Medical Licensing Examination (USMLE) questions, used to test real-world medical knowledge

RAG: Retrieval-Augmented Generation—fetching relevant information (cases/rules) to prompt the LLM during inference

USMLE: United States Medical Licensing Examination—a standardized test for medical licensure

Zero-shot: Attempting a task without providing any specific training examples to the model beforehand