Evaluation Setup
Validation on a dataset of 125,000 outpatient visits, focusing on Cardiology, Endocrinology, and Gastroenterology.
Benchmarks:
- Internal Clinical Dataset (Medical Test Recommendation) [New]
Metrics:
- Coverage Rate (CR)
- Accuracy
- Miss Rate (MR)
- Clinical Relevance Score (CRS)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| HiRMed consistently outperforms baseline methods (Traditional Vector Similarity and Flat-RAG) across all metrics on the overall dataset. |
| Internal Clinical Dataset |
Coverage Rate (CR) |
84.7% |
92.3% |
+7.6%
|
| Internal Clinical Dataset |
Accuracy |
82.4% |
88.7% |
+6.3%
|
| Internal Clinical Dataset |
Miss Rate (MR) |
5.8% |
2.1% |
-3.7%
|
| Internal Clinical Dataset |
Clinical Relevance Score (CRS) |
3.7 |
4.3 |
+0.6
|
| Department-specific analysis shows HiRMed is particularly effective in Cardiology. |
| Cardiology Department |
Coverage Rate |
Not reported in the paper |
94.2% |
Not reported in the paper
|
Main Takeaways
- HiRMed significantly outperforms single-step RAG and vector similarity methods, particularly in reducing critical miss rates (2.1% vs 5.8% for Flat-RAG).
- The hierarchical structure allows for consistent performance across different specialties, with Cardiology showing the strongest results (94.2% coverage).
- Expert review confirms high clinical relevance (4.3/5.0), suggesting the system's reasoning aligns well with human medical decision-making.