| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Main evaluation on MIMIC-CDM demonstrates consistent accuracy gains and superhuman performance compared to baselines and human experts. | ||||
| MIMIC-CDM (Evaluation Cohort) | Diagnostic Accuracy | 79.2 | 90.4 | +11.2 |
| MIMIC-CDM (Reader Study Subset) | Diagnostic Accuracy | 88.8 | 90.4 | +1.6 |
| External validation shows the framework's robustness across languages and new disease categories. | ||||
| Chinese PLA General Hospital | Diagnostic Accuracy (English Trans.) | Not reported in the paper | Not reported in the paper | +10.2 |
| Chinese PLA General Hospital | Diagnostic Accuracy (Chinese Raw) | Not reported in the paper | Not reported in the paper | +11.9 |
| Chinese PLA General Hospital (Uncovered Categories) | Diagnostic Accuracy | Not reported in the paper | Not reported in the paper | +17.1 |