Evaluation Setup
Real-world ambulatory primary care clinic (urgent care visits)
Benchmarks:
- Clinical Feasibility Cohort (Real-world patient history taking and diagnosis) [New]
Metrics:
- Safety stops (count)
- Diagnostic accuracy (Bond/Graber scale, Top-k recall)
- Management plan quality (Likert scale)
- Patient attitudes (GAAIS)
- Statistical methodology: Two-way Wilcoxon signed-rank tests with Bonferroni correction for blinded ratings; Friedman omnibus tests for survey scales
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Safety and feasibility results demonstrate the system is viable for real-world deployment with supervision. |
| Clinical Feasibility Cohort |
Safety Stops |
0 |
0 |
0
|
| Diagnostic accuracy results show high concordance with ground truth derived from chart review. |
| Clinical Feasibility Cohort |
Inclusion of Final Diagnosis |
Not reported in the paper |
90 |
Not reported in the paper
|
| Clinical Feasibility Cohort |
Top-3 Accuracy |
Not reported in the paper |
75 |
Not reported in the paper
|
| Comparative ratings between AMIE and PCPs (blinded evaluators) reveal trade-offs in management planning. |
| Clinical Feasibility Cohort |
DDx Quality (p-value) |
0.05 |
0.6 |
Not applicable
|
| Clinical Feasibility Cohort |
Mx Practicality (p-value) |
0.05 |
0.003 |
Not applicable
|
| Clinical Feasibility Cohort |
Mx Cost Effectiveness (p-value) |
0.05 |
0.004 |
Not applicable
|
Main Takeaways
- AMIE demonstrated safe operation in a real-world setting with zero required safety interventions across 100 diverse patient encounters.
- Diagnostic reasoning is robust: The AI identified the correct diagnosis in 90% of cases, comparable to human PCPs in blinded quality ratings.
- While safe and accurate, AI management plans lag behind humans in practicality and cost-effectiveness, suggesting a tendency toward over-testing or theoretical rather than pragmatic care.
- Patient acceptance is high: Interactions with the AI significantly improved patient attitudes toward AI in healthcare.