AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

📝 Paper Summary

Multi-agent simulation Medical AI evaluation

AgentClinic evaluates medical AI agents not via static questions, but through interactive simulations with patient, moderator, and measurement agents to assess sequential decision-making, bias, and patient compliance.

Core Problem

Existing clinical benchmarks rely on static question-answering (e.g., USMLE), which fails to capture the complex, sequential, and dialogue-driven nature of real-world clinical decision-making.

Why it matters:

Static benchmarks overstate model capabilities; LLMs scoring 90%+ on USMLE fail significantly when required to gather information sequentially.
Clinical work requires handling uncertainty, limited resources, and compassionate patient interaction, which multiple-choice exams cannot measure.
Current evaluations do not assess how cognitive and implicit biases affect diagnostic accuracy or patient compliance in interactive settings.

Concrete Example: A model might correctly answer a static MedQA multiple-choice question about a disease. However, when placed in AgentClinic where it must ask the patient for symptoms and order tests (like blood pressure) to get that information, its accuracy drops from ~80% (static) to <20% (interactive) due to poor information gathering.

Key Novelty

Interactive Multi-Agent Clinical Environment (AgentClinic)

Simulates a full clinical encounter using four distinct agents: Doctor (the model being evaluated), Patient (simulated case with history/personality), Measurement (returns test results), and Moderator (manages protocol).
Evaluates beyond accuracy by measuring 'soft' metrics like patient compliance, confidence, and consultation ratings based on the interaction.
Introduces a mechanism to systematically inject 24 types of cognitive and implicit biases (e.g., recency bias, gender bias) into agent prompts to study their impact on care.

Architecture

The interactive loop between the four agents in AgentClinic.

Evaluation Highlights

Diagnostic accuracy for Llama-3-70B drops from relatively high static performance to 19% in the interactive AgentClinic-MedQA setting.
Claude-3.5 Sonnet achieves the highest interactive diagnostic accuracy (62.1%), outperforming GPT-4 (51.6%) and human physicians (54%).
Using a 'Notebook' tool allows Llama-3 to achieve up to 92% relative improvement in accuracy by persisting notes across cases.

Breakthrough Assessment

9/10

A significant leap from static benchmarks to interactive, agent-based clinical simulation. It reveals massive gaps in current SOTA models' real-world utility and introduces novel patient-centric metrics.

⚙️ Technical Details

Problem Definition

Setting: Sequential decision-making in a simulated clinical environment (OSCE - Objective Structured Clinical Examination)

Inputs: Initial patient presentation (chief complaint) and subsequent dialogue/measurement observations

Outputs: Final diagnosis and patient satisfaction ratings

Pipeline Flow

Moderator Agent initializes the case
Doctor Agent interacts with Patient Agent (dialogue) or Measurement Agent (tools)
Patient/Measurement Agents respond based on private case data
Doctor Agent submits final diagnosis
Moderator Agent evaluates diagnosis; Patient Agent evaluates satisfaction

System Modules

Patient Agent

Simulates a patient with specific symptoms, history, and personality/biases

Model or implementation: GPT-4 (default for evaluation)

Doctor Agent

The AI model being evaluated; aims to diagnose the patient

Model or implementation: Various (Claude-3.5, GPT-4, Llama-3, etc.)

Measurement Agent

Simulates medical devices/labs; returns physical exam or imaging results

Model or implementation: Rule-based/LLM-wrapper

Moderator Agent

Orchestrates the simulation loop and evaluates correctness

Model or implementation: LLM-based controller

Modeling

Base Model: Evaluated multiple models: Claude-3.5-Sonnet, GPT-4, GPT-4o, Mixtral-8x7B, Llama-3-70B-Instruct, MedLlama3-8B, etc.

Compute: Not reported in the paper

Comparison to Prior Work

vs. MedQA: AgentClinic requires sequential information gathering and tool use, not just selecting an answer from a provided vignette.
vs. BiasMedQA: AgentClinic incorporates bias subtly through agent interaction and persona instructions rather than explicit prompt snippets.

Limitations

Patient agents are simulated by LLMs (GPT-4), which may not perfectly reflect real human patient variability or irrationality.
Accuracy metrics depend on exact string matching or LLM-judged equivalence of the diagnosis.
Evaluation is computationally expensive due to multi-turn, multi-agent interactions.

Reproducibility

Code: https://agentclinic.github.io/

Publicly available at agentclinic.github.io. Code and data (structured JSON cases) are provided. Patient cases derived from MedQA, MIMIC-IV, and NEJM. Human physician baselines included for comparison.

📊 Experiments & Results

Evaluation Setup

Simulated clinical encounters across General Medicine (MedQA), Critical Care (MIMIC-IV), Specialists, and Multilingual settings.

Benchmarks:

AgentClinic-MedQA (General diagnosis (USMLE derived)) [New]
AgentClinic-MIMIC-IV (Critical care diagnosis (EHR derived)) [New]
AgentClinic-NEJM (Multimodal diagnosis (Image + Text)) [New]

Metrics:

Diagnostic Accuracy
Patient Confidence (1-10)
Patient Compliance (1-10)
Consultation Rating (1-10)
Statistical methodology: Confidence intervals reported for diagnostic accuracy.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main diagnostic accuracy results on AgentClinic-MedQA show Claude-3.5 and GPT-4 outperforming open-source models and reaching human-level performance.
AgentClinic-MedQA	Diagnostic Accuracy	54	62.1	+8.1
AgentClinic-MedQA	Diagnostic Accuracy	19.0	62.1	+43.1
AgentClinic-MIMIC-IV	Diagnostic Accuracy	34.0	42.9	+8.9
Bias experiments reveal that doctor/patient biases reduce diagnostic accuracy and patient trust, with implicit biases having profound effects on patient perception.
AgentClinic-MedQA	Normalized Accuracy (Cognitive Bias)	100	92.0	-8.0
AgentClinic-MedQA	Normalized Accuracy (Implicit Bias)	100	88.3	-11.7
Tool use experiments demonstrate that giving agents tools like Notebooks or Reflection cycles can significantly boost performance, especially for weaker models.
AgentClinic-MedQA	Diagnostic Accuracy	19.0	41.1	+22.1
AgentClinic-MedQA	Diagnostic Accuracy	36.6	26.7	-9.9

Experiment Figures

Bar charts comparing diagnostic accuracy of 11 LLMs and Human Physicians on AgentClinic-MedQA and AgentClinic-MIMIC-IV.

Impact of Cognitive and Implicit Biases on Accuracy and Patient Perception metrics (Confidence, Compliance, Consultation).

Main Takeaways

Static benchmarks like MedQA are poor predictors of interactive clinical performance; models with high USMLE scores (like Llama-3) can fail catastrophic in sequential environments.
Claude-3.5 Sonnet consistently outperforms other models (including GPT-4 and GPT-4o) across general, specialist, and multilingual settings.
Bias (both cognitive and implicit) quantifiably degrades diagnostic accuracy and, more severely, harms patient compliance and trust.
The utility of agent tools (RAG, Notebooks, Reflection) is model-dependent; stronger models leverage them effectively, while weaker models may get distracted and perform worse.
Multimodal capabilities are still maturing; even the best models (Claude 3.5) achieve only ~37% accuracy on image-based NEJM cases.

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Large Language Models (LLMs) and prompting
Understanding of medical diagnostic processes (anamnesis, testing, diagnosis)
Basic knowledge of agent-based systems (roles, turn-taking)

Key Terms

OSCE: Objective Structured Clinical Examination—a standard health sciences exam format testing clinical skill performance used as the template for agent interactions

MedQA: A dataset of medical questions taken from professional medical board exams

MIMIC-IV: A large database of deidentified health-related data associated with patients admitted to critical care units

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps

RAG: Retrieval-Augmented Generation—providing the model with external knowledge (e.g., from textbooks or web) to aid decision-making

Implicit Bias: Unconscious associations (e.g., race, gender) affecting patient interactions and decisions

Cognitive Bias: Systematic deviations from rational judgment (e.g., anchoring, recency bias) affecting diagnosis