Evaluation Setup
Simulation of clinical tasks using structured patient cases from PMC (post-July 2024 to avoid data contamination)
Benchmarks:
- MedR-Bench-Diagnosis (Diagnostic decision-making and examination recommendation (957 cases)) [New]
- MedR-Bench-Treatment (Treatment planning (496 cases)) [New]
Metrics:
- Accuracy (Final Diagnosis/Treatment)
- Precision & Recall (Examination Recommendation)
- Efficiency (Reasoning)
- Factuality (Reasoning)
- Completeness (Reasoning)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Diagnostic accuracy results show DeepSeek-R1 leading across all settings, with significant improvements when full information (oracle) is provided. |
| MedR-Bench-Diagnosis (Oracle) |
Accuracy |
84.53 |
89.76 |
+5.23
|
| MedR-Bench-Diagnosis (1-turn) |
Accuracy |
64.99 |
71.79 |
+6.80
|
| Examination Recommendation results highlight a tradeoff between precision and recall, with models generally struggling to identify relevant tests accurately. |
| Examination Recommendation (1-turn) |
Recall |
43.12 |
43.61 |
+0.49
|
| Examination Recommendation (1-turn) |
Precision |
32.48 |
41.78 |
+9.30
|
| Reasoning quality metrics show that while models are factual, they differ significantly in efficiency. |
| MedR-Bench-Diagnosis (Oracle) |
Efficiency |
71.20 |
97.17 |
+25.97
|
| MedR-Bench-Diagnosis (Oracle) |
Factuality |
84.02 |
98.23 |
+14.21
|
Main Takeaways
- Open-source models like DeepSeek-R1 are competitive with or superior to proprietary models (OpenAI-o3-mini) in clinical diagnostic accuracy (89.76% vs 84.53%).
- Models perform well (>85% accuracy) on diagnosis when information is complete (oracle) but struggle significantly with information gathering (examination recommendation), showing low recall (<44%).
- Treatment planning remains a difficult task, with precision scores for treatment plans (~30%) being much lower than diagnostic accuracy.
- Rare disease performance is consistent with common diseases for diagnosis, suggesting robust knowledge, but treatment planning precision drops for rare conditions across most models.