Human-AI Co-reasoning for Clinical Diagnosis with Evidence-Integrated Language Agent

📝 Paper Summary

Medical reasoning agents Human-AI collaboration

PULSE is a medical reasoning agent combining large language models with scientific literature retrieval that matches senior specialist accuracy in complex endocrinology cases and stabilizes diagnostic performance across rare diseases.

Core Problem

Diagnostic errors are common in complex medical fields like endocrinology because atypical or rare diseases appear infrequently, preventing physicians from building recognition patterns.

Why it matters:

Patients with rare or multisystem diseases often suffer prolonged diagnostic journeys due to nonspecific early symptoms.
Diagnostic performance varies wildly based on physician experience; trainees often succumb to premature closure.
Existing AI evaluations often focus on simplified vignettes rather than complex real-world cases with longitudinal data.

Concrete Example: In an ultra-rare endocrinology case (<0.001% incidence), a junior specialist might fixate on common symptoms and miss the diagnosis (25.6% accuracy), whereas the AI agent maintains stable performance by retrieving relevant literature regardless of rarity.

Key Novelty

Evidence-Integrated Reasoning Agent (PULSE)

Combines a reasoning-oriented LLM with a scientific literature retrieval engine to ground diagnoses in up-to-date medical evidence.
Exhibits 'adaptive thinking' by increasing output length (reasoning intensity) for harder cases, mimicking expert human deliberation.
Evaluated via distinct collaboration workflows: 'Serial' (post-hoc review) vs. 'Concurrent' (real-time co-pilot), showing different impacts on physician autonomy.

Evaluation Highlights

PULSE achieved 57.32% Top@1 accuracy, significantly outperforming residents (23.41%) and junior specialists (34.63%) while matching senior specialists (65.85%, p=0.25).
In ultra-rare disease cases (<0.001% incidence), PULSE maintained stable accuracy, whereas junior specialists' performance dropped significantly to 25.6%.
Concurrent AI assistance improved residents' Top@1 accuracy from ~23% to 48.8%–62.2%, effectively closing the gap with unassisted specialists.

Breakthrough Assessment

8/10

Strong empirical demonstration of AI matching senior specialists in complex real-world cases and effectively closing the experience gap for trainees. The rigorous comparison of serial vs. concurrent workflows provides valuable HCI insights.

⚙️ Technical Details

Problem Definition

Setting: Multi-choice clinical diagnosis based on complex, unstructured case reports.

Inputs: Detailed clinical presentations, laboratory findings, imaging data, and pathological information.

Outputs: Ranked list of differential diagnoses (Primary + 3 alternatives).

Pipeline Flow

Literature Retrieval → Reasoning & Generation

System Modules

Literature Retrieval Engine

Search scientific literature to find evidence relevant to the clinical case

Model or implementation: Not explicitly specified in paper text (generic retrieval implied)

Reasoning Agent (PULSE)

Synthesize case data and retrieved evidence to generate a differential diagnosis

Model or implementation: Domain-tuned large language model (specific architecture not detailed in text)

Modeling

Base Model: Domain-tuned large language model (PULSE)

Compute: Not reported in the paper

Comparison to Prior Work

vs. AMIE: PULSE focuses on evidence-integrated diagnosis for complex real-world cases rather than simulated dialogue.
vs. OpenAI-o1/DeepSeek-R1: PULSE is specifically domain-tuned and integrates external literature retrieval, whereas general reasoning models may lack specialized medical context or external evidence access.
vs. Physicians: PULSE maintains consistent accuracy across disease rarity tiers, whereas human performance degrades significantly for rare conditions.

Limitations

Senior specialists still outperform the agent in Top@1 specificity (though not statistically significant).
Risk of automation bias: physicians occasionally adopted incorrect AI suggestions even when their initial diagnosis was correct.
Evaluation limited to 82 endocrinology cases; generalizability to other medical specialties is untested.
Concurrent collaboration showed lower AI-physician agreement, indicating heterogeneous usage patterns among trainees.

Reproducibility

No replication artifacts mentioned in the paper. The specific base model architecture, training data, and retrieval implementation details are not provided in the text. The 82-case benchmark is described but availability is not explicitly stated.

📊 Experiments & Results

Evaluation Setup

Comparative benchmarking on 82 complex, real-world endocrinology case reports.

Benchmarks:

Endocrinology Case Benchmark (Clinical Diagnosis (Primary + Differential)) [New]

Metrics:

Top@1 Accuracy (Primary diagnosis matches reference)
Top@4 Accuracy (Reference appears in top 4 diagnoses)
Diagnostic Time (Human) / Output Length (Model)
Number of distinct diagnoses generated (Hypothesis space breadth)
Statistical methodology: Paired McNemar tests with Holm correction for multiple comparisons; Spearman correlation for effort vs. difficulty.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
PULSE significantly outperforms residents and junior specialists, achieving parity with senior specialists.
Endocrinology Case Benchmark	Top@1 Accuracy	23.41	57.32	+33.91
Endocrinology Case Benchmark	Top@1 Accuracy	34.63	57.32	+22.69
Endocrinology Case Benchmark	Top@1 Accuracy	65.85	57.32	-8.53
Endocrinology Case Benchmark	Top@4 Accuracy	82.93	79.27	-3.66
AI assistance consistently improves physician performance, particularly for less experienced groups and in rare diseases.
Endocrinology Case Benchmark	Top@1 Accuracy	34.63	47.6	+12.97
Endocrinology Case Benchmark	Top@1 Accuracy	23.41	48.8	+25.39

Main Takeaways

PULSE exhibits 'adaptive reasoning': output length correlates with case difficulty (similar to experts taking more time), whereas general models like Qwen3 or DeepSeek-R1 did not show this trend.
Collaboration acts as an equalizer: AI assistance provided the largest gains in ultra-rare disease cases (<0.001% incidence), where human performance was initially lowest.
Workflow matters: 'Serial' collaboration (post-hoc) led to high AI-physician agreement (convergence), while 'Concurrent' collaboration (co-pilot) preserved more physician autonomy and hypothesis diversity.
Hypothesis space: PULSE generated a broader range of distinct disease entities (270) compared to senior specialists (209), suggesting it considers a wider differential.
Automation bias exists: In serial collaboration, physicians sometimes switched from a correct independent diagnosis to an incorrect AI suggestion, though net benefit remained positive.

📚 Prerequisite Knowledge

Prerequisites

Clinical reasoning frameworks (differential diagnosis)
Large Language Models (LLMs) and RAG (Retrieval-Augmented Generation)
Medical evaluation metrics (Top@K accuracy)

Key Terms

Top@1: Accuracy metric checking if the primary (first-ranked) diagnosis matches the gold standard.

Top@4: Accuracy metric checking if the gold standard diagnosis appears anywhere in the top 4 predicted diagnoses.

Differential Diagnosis (DDx): The process of differentiating between two or more conditions which share similar signs or symptoms.

Serial Collaboration: A workflow where the physician diagnoses independently first, then reviews AI suggestions to revise their decision.

Concurrent Collaboration: A workflow where the physician receives AI assistance (co-pilot) simultaneously while reviewing the case material.

Automation Bias: The tendency of humans to over-rely on automated systems, potentially accepting incorrect AI suggestions.

Incidence Tiers: Categorization of diseases based on their frequency in the population (e.g., Common, Rare, Ultra-rare).

Reasoning-oriented LLM: An LLM optimized to generate intermediate reasoning steps (chains of thought) before producing a final answer.

GPT-5.1: A hypothetical or future model version referenced in the paper as an automated evaluator for semantic consistency.