← Back to Paper List

End-to-end agenticRAGsystem training for traceable diagnostic reasoning

Q Zheng, Y Sun, C Wu, W Zhao, P Qiu, Y Yu…
Shanghai Jiao Tong University, Shanghai, China, Shanghai AI Laboratory, Shanghai, China, Xinhua Hospital affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai, China
arXiv, 8/2025 (2025)
RAG Agent RL Factuality Reasoning Benchmark

📝 Paper Summary

Agentic RAG pipeline Medical diagnosis
Deep-DxSearch is an agentic RAG system trained end-to-end via reinforcement learning to perform iterative, evidence-based diagnostic reasoning using a massive medical corpus.
Core Problem
Standard medical RAG systems rely on static, heuristic-driven retrieval that fails to capture the iterative, hypothetico-deductive reasoning of experts, while general LLMs suffer from hallucinations and lack of evidence provenance.
Why it matters:
  • Clinicians require 'traceable' reasoning anchored in guidelines and precedents, not just opaque predictions
  • Static 'one-shot' retrieval fails when initial evidence is ambiguous or conflicting, lacking the ability to actively refine search queries
  • General LLMs often hallucinate in high-stakes medical settings and cannot reliably handle the 'long-tail' of rare diseases
Concrete Example: A physician encountering a patient with atypical Lupus symptoms might iteratively check guidelines for skin rashes, then search for historical cases with similar renal issues. A standard RAG model retrieves documents once based on the initial query, often missing the specific nuance needed to differentiate Lupus from mimics.
Key Novelty
Deep-DxSearch: End-to-End Agentic RL for Diagnosis
  • Models the LLM as an autonomous agent that interacts with a medical environment (guidelines, patient records, literature) via defined actions like <lookup>, <match>, and <search>
  • Optimizes a single policy using reinforcement learning with a composite reward function that balances diagnostic accuracy, evidence validity, and trajectory diversity
  • Introduces a <match> action specifically for Case-Based Reasoning (CBR), allowing the agent to retrieve and compare against a database of 150k+ historical patient records
Evaluation Highlights
  • +22.7% accuracy improvement over the second-best model (MedRAG) on average across benchmarks, surpassing GPT-4o and DeepSeek-R1
  • Elevates physicians' average diagnostic accuracy from 45.6% to 69.1% in a human-in-the-loop study involving 150 real-world cases
  • Achieves 52.7% Top-1 accuracy on the out-of-distribution Mendeley benchmark, outperforming the training-free RAG baseline (MedRAG) by 5.8%
Breakthrough Assessment
9/10
Significant advancement in medical AI by successfully applying end-to-end RL to agentic RAG. The move from static retrieval to active, policy-driven investigation with verifiable evidence trails addresses key adoption barriers in healthcare.
×