← Back to Paper List

Human-AI Co-reasoning for Clinical Diagnosis with Evidence-Integrated Language Agent

Zhongzhen Huang, Yan Ling, Hong Chen, Ye Feng, Li Wu, Linjie Mu, Shaoting Zhang, Xiaofan Zhang, Kun Qian, Xiaomu Li
Shanghai Jiao Tong University, Department of Endocrinology, The First Affiliated Hospital, Zhejiang University School of Medicine, Department of Endocrinology, Zhongshan Hospital, Fudan University, SenseTime, Multimedia Laboratory, The Chinese University of Hong Kong
arXiv (2026)
Agent RAG Reasoning Benchmark

📝 Paper Summary

Medical reasoning agents Human-AI collaboration
PULSE is a medical reasoning agent combining large language models with scientific literature retrieval that matches senior specialist accuracy in complex endocrinology cases and stabilizes diagnostic performance across rare diseases.
Core Problem
Diagnostic errors are common in complex medical fields like endocrinology because atypical or rare diseases appear infrequently, preventing physicians from building recognition patterns.
Why it matters:
  • Patients with rare or multisystem diseases often suffer prolonged diagnostic journeys due to nonspecific early symptoms.
  • Diagnostic performance varies wildly based on physician experience; trainees often succumb to premature closure.
  • Existing AI evaluations often focus on simplified vignettes rather than complex real-world cases with longitudinal data.
Concrete Example: In an ultra-rare endocrinology case (<0.001% incidence), a junior specialist might fixate on common symptoms and miss the diagnosis (25.6% accuracy), whereas the AI agent maintains stable performance by retrieving relevant literature regardless of rarity.
Key Novelty
Evidence-Integrated Reasoning Agent (PULSE)
  • Combines a reasoning-oriented LLM with a scientific literature retrieval engine to ground diagnoses in up-to-date medical evidence.
  • Exhibits 'adaptive thinking' by increasing output length (reasoning intensity) for harder cases, mimicking expert human deliberation.
  • Evaluated via distinct collaboration workflows: 'Serial' (post-hoc review) vs. 'Concurrent' (real-time co-pilot), showing different impacts on physician autonomy.
Evaluation Highlights
  • PULSE achieved 57.32% Top@1 accuracy, significantly outperforming residents (23.41%) and junior specialists (34.63%) while matching senior specialists (65.85%, p=0.25).
  • In ultra-rare disease cases (<0.001% incidence), PULSE maintained stable accuracy, whereas junior specialists' performance dropped significantly to 25.6%.
  • Concurrent AI assistance improved residents' Top@1 accuracy from ~23% to 48.8%–62.2%, effectively closing the gap with unassisted specialists.
Breakthrough Assessment
8/10
Strong empirical demonstration of AI matching senior specialists in complex real-world cases and effectively closing the experience gap for trainees. The rigorous comparison of serial vs. concurrent workflows provides valuable HCI insights.
×