End-to-end agenticRAGsystem training for traceable diagnostic reasoning

📝 Paper Summary

Agentic RAG pipeline Medical diagnosis

Deep-DxSearch is an agentic RAG system trained end-to-end via reinforcement learning to perform iterative, evidence-based diagnostic reasoning using a massive medical corpus.

Core Problem

Standard medical RAG systems rely on static, heuristic-driven retrieval that fails to capture the iterative, hypothetico-deductive reasoning of experts, while general LLMs suffer from hallucinations and lack of evidence provenance.

Why it matters:

Clinicians require 'traceable' reasoning anchored in guidelines and precedents, not just opaque predictions
Static 'one-shot' retrieval fails when initial evidence is ambiguous or conflicting, lacking the ability to actively refine search queries
General LLMs often hallucinate in high-stakes medical settings and cannot reliably handle the 'long-tail' of rare diseases

Concrete Example: A physician encountering a patient with atypical Lupus symptoms might iteratively check guidelines for skin rashes, then search for historical cases with similar renal issues. A standard RAG model retrieves documents once based on the initial query, often missing the specific nuance needed to differentiate Lupus from mimics.

Key Novelty

Deep-DxSearch: End-to-End Agentic RL for Diagnosis

Models the LLM as an autonomous agent that interacts with a medical environment (guidelines, patient records, literature) via defined actions like <lookup>, <match>, and <search>
Optimizes a single policy using reinforcement learning with a composite reward function that balances diagnostic accuracy, evidence validity, and trajectory diversity
Introduces a <match> action specifically for Case-Based Reasoning (CBR), allowing the agent to retrieve and compare against a database of 150k+ historical patient records

Evaluation Highlights

+22.7% accuracy improvement over the second-best model (MedRAG) on average across benchmarks, surpassing GPT-4o and DeepSeek-R1
Elevates physicians' average diagnostic accuracy from 45.6% to 69.1% in a human-in-the-loop study involving 150 real-world cases
Achieves 52.7% Top-1 accuracy on the out-of-distribution Mendeley benchmark, outperforming the training-free RAG baseline (MedRAG) by 5.8%

Breakthrough Assessment

9/10

Significant advancement in medical AI by successfully applying end-to-end RL to agentic RAG. The move from static retrieval to active, policy-driven investigation with verifiable evidence trails addresses key adoption barriers in healthcare.

⚙️ Technical Details

Problem Definition

Setting: Sequential decision-making process for clinical diagnosis

Inputs: Free-text clinical presentation (patient symptoms and history)

Outputs: Final diagnostic assessment with a traceable chain of evidence

Pipeline Flow

Agent receives Clinical Case
Agent iterates: Generate Thought -> Choose Action (<lookup>, <match>, <search>) -> Receive Observation from Environment
Agent outputs Final Diagnosis <diagnose>

System Modules

Reasoning Agent

Orchestrates the diagnostic process by generating thoughts and selecting actions

Model or implementation: MedGemma-27B (or Qwen/Llama variants)

Disease Guideline Retriever (Environment / Retrieval)

Retrieves structured disease profiles and diagnostic criteria

Model or implementation: Keyword/Semantic Search over structured database

Patient Record Matcher (Environment / Retrieval)

Retrieves similar historical patient cases (Case-Based Reasoning)

Model or implementation: Similarity search over patient database

Literature Searcher (Environment / Retrieval)

Retrieves biomedical literature for semantic context

Model or implementation: Search engine over 27M+ documents (PubMed, Wiki)

Novel Architectural Elements

Integration of a dedicated <match> primitive for Case-Based Reasoning alongside standard document retrieval
Tripartite environment design (Guidelines, Patient Records, Literature) specifically structured for medical inquiry
End-to-end RL optimization of the retrieval interaction policy rather than just the generation or retrieval modules in isolation

Modeling

Base Model: MedGemma-27B (primary), also evaluated with Qwen2.5-7B/14B, Llama3.1-8B, BaichuanM2

Training Method: Reinforcement Learning (Policy Optimization)

Objective Functions:

Purpose: Encourage accurate final diagnoses.

Formally: Reward based on Top-N accuracy of the <diagnose> output.
Purpose: Encourage valid evidence retrieval (Retrieve-Reason).

Formally: Reward for uncovering high-fidelity evidence (e.g., matching diagnosis in retrieved cases) explicitly supporting conclusions.
Purpose: Encourage diverse exploration.

Formally: Trajectory exploration reward to prevent policy collapse into rigid patterns.

Adaptation: Full model fine-tuning (inferred from context of 'End-to-End Agentic RAG System Training')

Training Data:

Multi-center cohort of 24k+ clinical cases
Sources: MIMIC-IV, PMC-Patients, MedDialog, RareArena, RareBench
Split: 3:1 ratio for ID training/test

Compute: Not reported in the paper

Comparison to Prior Work

vs. MedRAG: Deep-DxSearch uses active, iterative retrieval via RL rather than one-shot static retrieval
vs. Meditron/MedGemma: Adds agentic capabilities and external grounding, reducing hallucination compared to pure parametric models
vs. MAC: Optimizes a single unified policy end-to-end rather than coordinating multiple distinct role-playing agents
+ 1 more
vs. DoctorAgent: Incorporates a specialized 'match' action for case-based reasoning and utilizes a significantly larger, multi-source retrieval environment (27M+ docs vs limited sets)

Limitations

Database knowledge gaps still contribute to errors (e.g., failure to retrieve key features)
Time efficiency is lower than direct-answer models due to the deliberative retrieval process (>30s vs desired <20s)
Current modality is text-only; does not handle radiology or pathology images
Premature diagnostic closure still occurs, particularly in rare diseases

Reproducibility

Code: https://qiaoyu-zheng.github.io/Deep-DxSearch

Code, data, and checkpoints are available at https://qiaoyu-zheng.github.io/Deep-DxSearch. The paper details the sources of the 16k guidelines, 150k patient records, and 27M literature documents.

📊 Experiments & Results

Evaluation Setup

Diagnostic accuracy and reasoning quality assessment across multi-center datasets

Benchmarks:

MIMIC-IV (Common/Rare) (Clinical Diagnosis)
PMC-Patients (Clinical Diagnosis)
MedDialog (Clinical Diagnosis)
RareArena / RareBench (Rare Disease Diagnosis)
Mendeley / Xinhua-Rare (Out-of-Distribution (OOD) Diagnosis)

Metrics:

Top-1 Accuracy
Top-5 Accuracy
Hit@N (Retrieval)
Hint Score
Action Steps
Statistical methodology: p-values reported (p < 0.01, p < 0.05)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Deep-DxSearch consistently outperforms general and medical baselines on both In-Distribution (ID) and Out-of-Distribution (OOD) datasets.
In-Distribution Average (Common)	Top-1 Accuracy	23.0	42.2	+19.2
In-Distribution Average (Rare)	Top-1 Accuracy	34.0	52.5	+18.5
Mendeley (Common)	Top-1 Accuracy	46.9	52.7	+5.8
Xinhua-Rare	Top-1 Accuracy	45.1	46.3	+1.2
Ablation studies demonstrate the critical role of the RL policy and specific reward components.
MedDialog	Top-1 Accuracy	9.0	49.3	+40.3
Average	Accuracy	Not reported in the paper	Not reported in the paper	Not reported in the paper
Human evaluation confirms clinical utility.
Real-world cases (N=150)	Diagnostic Accuracy	45.6	69.1	+23.5

Main Takeaways

Agentic RL training outperforms static Prompt Engineering, RAG, and SFT, especially on OOD data where SFT tends to overfit.
Case-based reasoning (via <match>) is the most critical retrieval component; removing it causes the largest performance drop.
Physicians prefer the 'glass box' transparency of Deep-DxSearch (Score 4.2/5) over the opaque reasoning of DeepSeek-R1 (3.3/5), even though it is slower.
The system learns to 'dig deeper'—average trajectory length increases to >5.5 steps with RL, compared to ~3 steps for baselines.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals (Policy Optimization)
Retrieval-Augmented Generation (RAG)
Medical diagnostic workflows (Differential diagnosis, Evidence-Based Medicine)

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

Agentic RAG: RAG systems where the model acts as an agent, autonomously deciding when and what to retrieve in multiple steps rather than a single fixed retrieval step

Hypothetico-deductive reasoning: The scientific method used in clinical diagnosis: formulating hypotheses based on symptoms and testing them against evidence

SFT: Supervised Fine-Tuning—training a model on labeled examples

RL: Reinforcement Learning—training an agent to maximize a reward signal through trial and error

Case-Based Reasoning (CBR): Solving new problems based on the solutions of similar past problems (implemented here via the <match> action)

OOD: Out-of-Distribution—evaluating on data from sources or distributions not seen during training

Hit@N: A metric measuring if the correct information (e.g., diagnosis) appears in the top N retrieved items

PPA: Positive Percent Agreement—measure of consensus between evaluators