A Tree-basedRAG-Agent Recommendation System: A Case Study in Medical Test Data

📝 Paper Summary

Medical Recommendation Systems Modularized RAG pipeline Hierarchical Reasoning

HiRMed improves medical test recommendations by using a tree-structured RAG system that progressively refines diagnostic decisions from general symptoms to specific departments and finally to individual test items.

Core Problem

Traditional medical test recommendation systems (rule-based or flat similarity matching) fail to capture the hierarchical reasoning process of doctors and often miss complex symptom-disease relationships.

Why it matters:

Direct vector matching lacks the nuanced, multi-stage reasoning required for accurate medical diagnosis, often overlooking the context of resource constraints and diagnostic uncertainty
Existing LLM/RAG approaches lack a structural hierarchy, leading to insufficient integration of specialized domain knowledge at different diagnostic stages
Poor recommendations increase miss rates for critical tests and reduce diagnostic accuracy in outpatient settings

Concrete Example: A standard RAG system might directly map 'chest pain' to a generic set of tests without first determining if the context suggests a cardiac issue versus a gastrointestinal one, potentially missing specialized cardiac markers.

Key Novelty

Hierarchical RAG-enhanced Medical Test Recommendation (HiRMed)

Implements a tree-structured architecture (Root → Department → Item) where each node performs a specialized RAG process to narrow down the diagnostic path progressively
Uses a dual-layer knowledge base (general department-level vs. specific item-level) to provide context-appropriate medical knowledge at each reasoning stage
Incorporates a memory mechanism to pass reasoning history between layers, ensuring consistency as the system moves from broad symptom assessment to specific test selection

Architecture

The three-layer hierarchical architecture of HiRMed (Root Layer → Department Layer → Item Layer), showing how patient queries are processed through progressive reasoning steps.

Evaluation Highlights

Achieves 92.3% coverage rate of relevant diagnostic tests, outperforming traditional vector similarity (72.8%) and flat RAG (84.7%)
Reduces miss rate for critical tests to 2.1%, compared to 5.8% for flat RAG and 10.6% for vector similarity
Attains a clinical relevance score of 4.3/5.0 in expert physician review, verifying the medical appropriateness of recommendations

Breakthrough Assessment

7/10

Strong practical application of hierarchical RAG to the medical domain with significant performance gains over flat baselines. While the components (RAG, trees) are known, the specific integration for medical reasoning is well-executed and validated.

⚙️ Technical Details

Problem Definition

Setting: Outpatient medical test recommendation using structured patient data and medical knowledge bases

Inputs: Patient query containing demographics, physical parameters, and presenting symptoms

Outputs: Ranked list of recommended diagnostic tests with interpretive scores

Pipeline Flow

Root Layer (Symptom Analysis & Dept Routing)
Department Layer (Specialty-Specific Reasoning)
Item Layer (Specific Test Selection & Weighting)

System Modules

Root Layer Reasoning (Reasoning & Routing)

Analyze initial symptoms to identify potential medical specialties (departments)

Model or implementation: GPT-O1

Root Layer Weighting (Weighting)

Prioritize recommended departments

Model or implementation: Fine-tuned LLaMA3.2-3B

Department Layer Reasoning (Reasoning & Routing)

Refine hypotheses using specialty guidelines to narrow down potential tests

Model or implementation: GPT-O1

Item Layer Reasoning

Select specific diagnostic tests and resolve inconsistencies using reasoning history

Model or implementation: GPT-O1 (with Memory component)

Item Layer Weighting (Weighting)

Assign final weights/urgency scores to recommended tests

Model or implementation: Fine-tuned LLaMA3.2-3B

Novel Architectural Elements

Three-layer hierarchical RAG structure (Root → Department → Item) mirroring medical diagnostic logic
Dual-layer knowledge base integration (General Department vs. Granular Item)
Memory-augmented reasoning mechanism passing context between hierarchical nodes

Modeling

Base Model: GPT-O1 (Reasoning) and LLaMA3.2-3B (Weighting)

Training Method: Fine-tuning (Supervised)

Adaptation: Fine-tuning on LLaMA3.2-3B

Training Data:

125,000 outpatient visit records from multiple departments
Records include initial consultation, recommended tests, follow-up tests, and outcomes
Standardized medical terminology and removed PII

Compute: Not reported in the paper

Comparison to Prior Work

vs. Flat-RAG: HiRMed uses hierarchical reasoning steps (dept → item) rather than a single retrieval step
vs. TVS: HiRMed incorporates LLM-based reasoning and context awareness rather than just semantic matching
vs. Zhang et al. [DRLK]: Both use hierarchy, but HiRMed integrates RAG at each node explicitly for medical test recommendation rather than general QA

Limitations

Evaluation limited to three departments (cardiology, endocrinology, gastroenterology)
Performance depends heavily on the quality and completeness of the underlying knowledge base
Computation costs of multi-stage RAG (calling GPT-O1 multiple times per patient) are not analyzed
No statistical significance tests reported for the performance improvements

Reproducibility

No replication artifacts mentioned in the paper. Code URL, specific prompts, and fine-tuned weights are not provided. Dataset is proprietary hospital data.

📊 Experiments & Results

Evaluation Setup

Validation on a dataset of 125,000 outpatient visits, focusing on Cardiology, Endocrinology, and Gastroenterology.

Benchmarks:

Internal Clinical Dataset (Medical Test Recommendation) [New]

Metrics:

Coverage Rate (CR)
Accuracy
Miss Rate (MR)
Clinical Relevance Score (CRS)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
HiRMed consistently outperforms baseline methods (Traditional Vector Similarity and Flat-RAG) across all metrics on the overall dataset.
Internal Clinical Dataset	Coverage Rate (CR)	84.7%	92.3%	+7.6%
Internal Clinical Dataset	Accuracy	82.4%	88.7%	+6.3%
Internal Clinical Dataset	Miss Rate (MR)	5.8%	2.1%	-3.7%
Internal Clinical Dataset	Clinical Relevance Score (CRS)	3.7	4.3	+0.6
Department-specific analysis shows HiRMed is particularly effective in Cardiology.
Cardiology Department	Coverage Rate	Not reported in the paper	94.2%	Not reported in the paper

Main Takeaways

HiRMed significantly outperforms single-step RAG and vector similarity methods, particularly in reducing critical miss rates (2.1% vs 5.8% for Flat-RAG).
The hierarchical structure allows for consistent performance across different specialties, with Cardiology showing the strongest results (94.2% coverage).
Expert review confirms high clinical relevance (4.3/5.0), suggesting the system's reasoning aligns well with human medical decision-making.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
Vector embeddings and similarity search
Basic medical diagnostic workflows (triage/specialty/test)

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

HiRMed: Hierarchical RAG-enhanced Medical Test Recommendation—the proposed system in this paper

Flat-RAG: A baseline RAG system that maps symptoms directly to tests without hierarchical steps

FAISS: Facebook AI Similarity Search—a library for efficient similarity search of dense vectors

TVS: Traditional Vector Similarity—a baseline method using simple cosine similarity between symptom and test embeddings

CRS: Clinical Relevance Score—an expert-assigned score (1-5) evaluating medical appropriateness

Miss Rate: The proportion of critical tests determined by physician review that were not recommended by the system

Coverage Rate: The proportion of relevant diagnostic tests included in the system's recommendations