Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases

📝 Paper Summary

Medical LLM Benchmarking Clinical Reasoning Evaluation Rare Disease Diagnosis

MedR-Bench introduces a dataset of 1,453 structured patient cases and an automated Reasoning Evaluator to assess LLMs' clinical reasoning processes alongside their final diagnostic outputs.

Core Problem

Existing medical LLM benchmarks primarily evaluate final outputs (e.g., diagnosis accuracy) while neglecting the quality, transparency, and coherence of the reasoning process itself.

Why it matters:

Clinical practice requires constructing logical reasoning chains from incomplete information, not just guessing the final label
Evaluating only final answers fails to capture whether the model reached the right conclusion for the right reasons (safety and reliability)
Current benchmarks lack sufficient coverage of the full patient care journey, specifically missing examination recommendation and complex treatment planning

Concrete Example: A model might correctly diagnose 'appendicitis' based on keywords but fail to recommend the necessary confirmative CT scan or explain *why* it ruled out diverticulitis, making the correct diagnosis brittle and untrustworthy in practice.

Key Novelty

Reasoning-Centric Clinical Evaluation Framework

Deconstructs clinical cases into three stages (examination recommendation, diagnosis, treatment) to simulate the full patient care trajectory rather than just QA
Introduces the 'Reasoning Evaluator', an automated system that cross-references free-text reasoning with web-scale medical resources to score efficiency, factuality, and completeness
Includes a dedicated subset of rare diseases (656 cases) to test robustness on long-tail medical conditions

Architecture

The MedR-Bench evaluation framework illustrating the three stages of the patient care journey simulation.

Evaluation Highlights

DeepSeek-R1 achieves 89.76% diagnostic accuracy in the oracle setting (where all info is provided), outperforming OpenAI-o3-mini (84.53%)
In the Examination Recommendation task (1-turn), precision is low across the board; Baichuan-M1 leads with only 41.78%, while Gemini-2.0-FT drops to 22.77%
Factuality of reasoning steps is generally high (>90% for most models), but Completeness varies widely, with Qwen-QwQ achieving 79.97% completeness in diagnosis due to verbose outputs

Breakthrough Assessment

8/10

Significant step forward by moving beyond accuracy metrics to process-based evaluation in medicine. The automated Reasoning Evaluator addresses a major bottleneck in evaluating free-text clinical reasoning.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of LLMs as clinical agents across three sequential tasks: recommending examinations, diagnosing diseases, and planning treatments.

Inputs: Structured patient case summaries (demographics, complaints, history) with varying levels of hidden ancillary test results.

Outputs: Free-text reasoning rationale followed by structured decisions (list of exams, final diagnosis, or treatment plan).

Pipeline Flow

Data Construction (Extract/Structure PMC cases)
Task Simulation (Agent Interaction)
Automated Evaluation (Reasoning Evaluator)

System Modules

Case Structurer

Converts raw PMC case reports into structured JSONs (Summary, Reasoning, Diagnosis/Treatment) using GPT-4o

Model or implementation: GPT-4o

Patient Agent

Simulates the patient during examination recommendation, holding ground-truth test results and revealing them when queried

Model or implementation: LLM-powered agent (Model unspecified, likely GPT-4o)

Reasoning Evaluator

Decomposes free-text reasoning into steps and verifies them against external knowledge

Model or implementation: LLM-based verifier with web access

Novel Architectural Elements

Three-stage clinical workflow simulation (Examination -> Diagnosis -> Treatment) rather than isolated QA tasks
Dynamic cross-referencing mechanism in the Reasoning Evaluator to validate specific reasoning steps against online medical evidence

Comparison to Prior Work

vs. MedQA: MedR-Bench evaluates free-text reasoning and multi-step decision making (exams -> diagnosis), not just final choice selection
vs. AgentClinic: MedR-Bench uses structured cases from peer-reviewed PMC reports rather than synthetic or simplified vignettes
vs. MedAlign [not cited in paper]: MedAlign focuses on alignment with clinician instructions; MedR-Bench focuses on the diagnostic accuracy and reasoning correctness against ground truth case reports
+ 1 more
Novelty: First benchmark to explicitly quantify 'reasoning efficiency' and 'reasoning completeness' alongside diagnostic accuracy in a clinical pipeline

Limitations

Evaluation relies on an LLM-based system (Reasoning Evaluator), which may propagate its own biases or errors
Free-turn setting showed models getting stuck in repetitive query loops, indicating limited ability to handle long-context dynamic interactions
Recall for examination recommendation is generally low (<45%), suggesting models struggle to identify all necessary tests even when available

Reproducibility

Code: https://github.com/MedR-Bench/MedR-Bench

publicly available (https://github.com/MedR-Bench/MedR-Bench). The benchmark data (1,453 cases), evaluation code, and model responses are released. Open-source models (DeepSeek, Qwen, Baichuan) were run locally; closed-source (OpenAI, Gemini) via API.

📊 Experiments & Results

Evaluation Setup

Simulation of clinical tasks using structured patient cases from PMC (post-July 2024 to avoid data contamination)

Benchmarks:

MedR-Bench-Diagnosis (Diagnostic decision-making and examination recommendation (957 cases)) [New]
MedR-Bench-Treatment (Treatment planning (496 cases)) [New]

Metrics:

Accuracy (Final Diagnosis/Treatment)
Precision & Recall (Examination Recommendation)
Efficiency (Reasoning)
Factuality (Reasoning)
Completeness (Reasoning)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Diagnostic accuracy results show DeepSeek-R1 leading across all settings, with significant improvements when full information (oracle) is provided.
MedR-Bench-Diagnosis (Oracle)	Accuracy	84.53	89.76	+5.23
MedR-Bench-Diagnosis (1-turn)	Accuracy	64.99	71.79	+6.80
Examination Recommendation results highlight a tradeoff between precision and recall, with models generally struggling to identify relevant tests accurately.
Examination Recommendation (1-turn)	Recall	43.12	43.61	+0.49
Examination Recommendation (1-turn)	Precision	32.48	41.78	+9.30
Reasoning quality metrics show that while models are factual, they differ significantly in efficiency.
MedR-Bench-Diagnosis (Oracle)	Efficiency	71.20	97.17	+25.97
MedR-Bench-Diagnosis (Oracle)	Factuality	84.02	98.23	+14.21

Experiment Figures

Spider charts or bar plots comparing model performance across Examination, Diagnosis, and Treatment tasks.

Main Takeaways

Open-source models like DeepSeek-R1 are competitive with or superior to proprietary models (OpenAI-o3-mini) in clinical diagnostic accuracy (89.76% vs 84.53%).
Models perform well (>85% accuracy) on diagnosis when information is complete (oracle) but struggle significantly with information gathering (examination recommendation), showing low recall (<44%).
Treatment planning remains a difficult task, with precision scores for treatment plans (~30%) being much lower than diagnostic accuracy.
Rare disease performance is consistent with common diseases for diagnosis, suggesting robust knowledge, but treatment planning precision drops for rare conditions across most models.

📚 Prerequisite Knowledge

Prerequisites

Clinical diagnostic workflows (differential diagnosis)
LLM evaluation metrics (Precision, Recall, Accuracy)
Basics of Retrieval-Augmented Generation (for the evaluator)

Key Terms

Oracle setting: An evaluation setup where the model is provided with all ground-truth information (including all lab/imaging results) to test its upper-bound reasoning capability

Reasoning Evaluator: The proposed automated system that verifies free-text reasoning steps against medical knowledge bases to calculate efficiency, factuality, and completeness

PMC Open Access Subset: A digital archive of biomedical and life sciences journal literature used as the source for the real-world case reports

1-turn vs. Free-turn: 1-turn restricts the model to a single round of querying for information; Free-turn allows iterative queries until the model decides it has sufficient info

Ancillary tests: Supplementary medical tests (labs, imaging) used to confirm or rule out diagnoses

DeepSeek-R1: An open-source reasoning-enhanced Large Language Model with 671 billion parameters