Evaluation Setup
Stepwise clinical diagnosis of acute abdominal presentations
Benchmarks:
- MIMIC-CDM (Sequential clinical diagnosis)
- Chinese PLA General Hospital Cohort (External real-world validation) [New]
Metrics:
- Diagnostic Accuracy
- Trajectory Consistency (alignment with human workup)
- Guideline Compliance Score
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Main evaluation on MIMIC-CDM demonstrates consistent accuracy gains and superhuman performance compared to baselines and human experts. |
| MIMIC-CDM (Evaluation Cohort) |
Diagnostic Accuracy |
79.2 |
90.4 |
+11.2
|
| MIMIC-CDM (Reader Study Subset) |
Diagnostic Accuracy |
88.8 |
90.4 |
+1.6
|
| External validation shows the framework's robustness across languages and new disease categories. |
| Chinese PLA General Hospital |
Diagnostic Accuracy (English Trans.) |
Not reported in the paper |
Not reported in the paper |
+10.2
|
| Chinese PLA General Hospital |
Diagnostic Accuracy (Chinese Raw) |
Not reported in the paper |
Not reported in the paper |
+11.9
|
| Chinese PLA General Hospital (Uncovered Categories) |
Diagnostic Accuracy |
Not reported in the paper |
Not reported in the paper |
+17.1
|
Main Takeaways
- Diagnostic Cognition Primitives (DCPs) allow the agent to improve accuracy (+11.2%) without parameter updates, converting experience into a governable asset
- The system achieves 'error-driven dividends', where experiences derived from past failures provide greater performance gains than those from successes
- Evolution is longitudinal: Experience from later-stage encounters (1700-2000) yields higher utility and clinician ratings than early-stage experience
- The framework generalizes across languages (Chinese/English) and institutions, indicating that DCPs capture robust clinical heuristics rather than dataset-specific artifacts