Evaluation Setup
Simulated clinical encounters across General Medicine (MedQA), Critical Care (MIMIC-IV), Specialists, and Multilingual settings.
Benchmarks:
- AgentClinic-MedQA (General diagnosis (USMLE derived)) [New]
- AgentClinic-MIMIC-IV (Critical care diagnosis (EHR derived)) [New]
- AgentClinic-NEJM (Multimodal diagnosis (Image + Text)) [New]
Metrics:
- Diagnostic Accuracy
- Patient Confidence (1-10)
- Patient Compliance (1-10)
- Consultation Rating (1-10)
- Statistical methodology: Confidence intervals reported for diagnostic accuracy.
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Main diagnostic accuracy results on AgentClinic-MedQA show Claude-3.5 and GPT-4 outperforming open-source models and reaching human-level performance. |
| AgentClinic-MedQA |
Diagnostic Accuracy |
54 |
62.1 |
+8.1
|
| AgentClinic-MedQA |
Diagnostic Accuracy |
19.0 |
62.1 |
+43.1
|
| AgentClinic-MIMIC-IV |
Diagnostic Accuracy |
34.0 |
42.9 |
+8.9
|
| Bias experiments reveal that doctor/patient biases reduce diagnostic accuracy and patient trust, with implicit biases having profound effects on patient perception. |
| AgentClinic-MedQA |
Normalized Accuracy (Cognitive Bias) |
100 |
92.0 |
-8.0
|
| AgentClinic-MedQA |
Normalized Accuracy (Implicit Bias) |
100 |
88.3 |
-11.7
|
| Tool use experiments demonstrate that giving agents tools like Notebooks or Reflection cycles can significantly boost performance, especially for weaker models. |
| AgentClinic-MedQA |
Diagnostic Accuracy |
19.0 |
41.1 |
+22.1
|
| AgentClinic-MedQA |
Diagnostic Accuracy |
36.6 |
26.7 |
-9.9
|
Main Takeaways
- Static benchmarks like MedQA are poor predictors of interactive clinical performance; models with high USMLE scores (like Llama-3) can fail catastrophic in sequential environments.
- Claude-3.5 Sonnet consistently outperforms other models (including GPT-4 and GPT-4o) across general, specialist, and multilingual settings.
- Bias (both cognitive and implicit) quantifiably degrades diagnostic accuracy and, more severely, harms patient compliance and trust.
- The utility of agent tools (RAG, Notebooks, Reflection) is model-dependent; stronger models leverage them effectively, while weaker models may get distracted and perform worse.
- Multimodal capabilities are still maturing; even the best models (Claude 3.5) achieve only ~37% accuracy on image-based NEJM cases.