HeartAgent: An Autonomous Agent System for Explainable Differential Diagnosis in Cardiology

📝 Paper Summary

Medical Agentic AI Clinical Decision Support Differential Diagnosis

HeartAgent is a multi-agent system that autonomously orchestrates specialized agents and cardiology-specific tools to generate verified differential diagnoses and explanations from multimodal clinical data.

Core Problem

General-purpose diagnostic AI often lacks deep cardiology knowledge, struggles with complex reasoning over overlapping symptoms, and operates as a 'black box' without verifiable evidence.

Why it matters:

Heart diseases are a leading cause of death requiring precise differentiation between conditions with similar presentations (e.g., aortic dissection vs. myocardial infarction)
Trustworthy clinical AI requires transparency and adherence to guidelines, not just prediction accuracy
Existing models fail to integrate heterogeneous data (notes, ECG, imaging) with external medical knowledge effectively

Concrete Example: A patient with chest pain and diaphoresis might be misdiagnosed by a standard model that misses subtle distinctions. HeartAgent's reviewers might catch that 'estimated valve area 0.6 cm²' implies valvular stenosis (not just general heart disease) by cross-referencing specific guidelines.

Key Novelty

Collaborative Multi-Agent Cardiology Framework

Decomposes diagnosis into collaborative roles: a predictor, a generalist examiner (checking non-cardiac causes), and a specialist reviewer (refining cardiac hypotheses)
Integrates a 'reference verification' step that explicitly retrieves and validates supporting evidence from guidelines to ground explanations
Dynamic tool use that combines visual analysis (ECG/Echo) with retrieval from curated cardiology knowledge bases

Evaluation Highlights

+36% improvement in top-3 diagnostic accuracy over Chain-of-Thought baselines on the MIMIC dataset
Clinicians assisted by HeartAgent surpassed unaided experts by 26.9% in diagnostic accuracy on random MIMIC cases
Achieved 92% precision in retrieving correct supporting references for diagnostic explanations

Breakthrough Assessment

8/10

Strong clinical utility demonstration with human-in-the-loop evaluation showing significant gains. The integration of self-verification and explicit reference retrieval addresses key trust issues in medical AI.

⚙️ Technical Details

Problem Definition

Setting: Multimodal differential diagnosis: Given patient notes, ECGs, and echocardiograms, output a ranked list of k probable diagnoses with verifiable rationales.

Inputs: Electronic health records (clinical notes), ECG waveforms/images, echocardiography data

Outputs: Top-k differential diagnoses, natural language explanations, and retrieved supporting references

Pipeline Flow

Tool Execution: Modality-specific analyzers process raw data (ECG, Echo, Notes)
Initial Diagnosis: Specialist Predictor generates hypotheses
Multi-Perspective Review: Generalist Examiner (non-cardiac) and Specialist Reviewer (cardiac) propose refinements
Self-Verification: Predictor integrates feedback and filters unlikely diagnoses
Evidence Grounding: Reference Agent retrieves and validates citations

System Modules

Modality Tools

Convert raw multimodal data into structured textual reports

Model or implementation: MedGemma-4B (Vision) / NeuroKit2 (ECG Signal)

Specialist Predictor (Reasoning Core)

Generate initial diagnoses and perform final self-verification/synthesis

Model or implementation: Llama-3.3-70B / MedGemma-27B (configurable)

Generalist Examiner (Reasoning Core)

Identify potential non-cardiac causes for symptoms to prevent tunnel vision

Model or implementation: Same backbone as Predictor

Specialist Reviewer (Reasoning Core)

Critique initial cardiac predictions and suggest missing plausible cardiac conditions

Model or implementation: Same backbone as Predictor

Reference Verification Agent

Retrieve and verify supporting literature for generated explanations

Model or implementation: BM25 + MedCPT (Retrieval), LLM (Verification)

Novel Architectural Elements

Separation of 'Specialist Reviewer' (cardiac depth) and 'Generalist Examiner' (non-cardiac breadth) to emulate a multidisciplinary consult
Explicit post-generation 'Reference Verification' module that validates generated rationales against retrieved textbook chunks via claim extraction

Modeling

Base Model: Evaluated with Llama-3.3-70B, Qwen-2.5-32B, and MedGemma-27B

Training Method: Inference-time agent orchestration (Prompt Engineering + RAG)

Key Hyperparameters:

temperature_reasoning: 0.1
temperature_tools: 0.0
retrieval_chunk_size: 800 words
+ 3 more
retrieval_stride: 50 words
retrieval_top_k_bm25: 20
retrieval_top_k_rerank: 5

Compute: Tools deployed with 4-bit quantization for GPU memory efficiency. Inference-only system.

Comparison to Prior Work

vs. Dual-Inf: HeartAgent adds explicit role-based agents (Generalist vs. Specialist) and external tool use, whereas Dual-Inf focuses on internal reasoning modes
vs. MDAgent: HeartAgent is specialized for cardiology with domain-specific tools (ECG/Echo analysis) and verifying references, unlike the general-purpose MDAgent
vs. Standard CoT: HeartAgent enables dynamic information seeking (Web/Knowledge Base) rather than relying solely on parametric knowledge

Limitations

Evaluated only on adult patients; pediatric applicability is unexplored
Performance of commercial LLMs (e.g., GPT-5) only evaluated on open data (NEJM) due to privacy constraints
Reference verification sometimes fails (20% miss rate) when clinical explanations use non-canonical terminology (e.g., 'valve area' vs. 'stenosis')

📊 Experiments & Results

Evaluation Setup

Differential diagnosis on real-world clinical notes and multimodal data

Benchmarks:

MIMIC-IV (Retrospective Electronic Health Records diagnosis)
UMN Dataset (Private clinical cohort diagnosis) [New]
NEJM Case Challenge (Complex medical case diagnosis (Open Source))

Metrics:

Top-1 Diagnostic Accuracy
Top-3 Diagnostic Accuracy
Explanation Quality Score (LLM-as-a-judge vs. Expert Ground Truth)
Reference Verification Precision/Recall
Statistical methodology: Two-sided Mann-Whitney U test for significance testing

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
HeartAgent consistently outperforms baseline methods across multiple LLM backbones on the MIMIC dataset.
MIMIC	Top-3 Accuracy	0.421	0.600	+0.179
MIMIC	Top-3 Accuracy	0.377	0.589	+0.212
UMN Dataset	Top-3 Accuracy	0.485	0.592	+0.107
NEJM	Top-3 Accuracy	0.411	0.556	+0.145
Human-AI collaboration experiments demonstrate that clinicians perform significantly better when assisted by HeartAgent.
MIMIC (Subsample)	Top-1 Accuracy	0.550	0.767	+0.217
MIMIC (Subsample)	Explanation Quality	0.554	0.781	+0.227
Ablation studies show that removing specific agents or knowledge sources significantly degrades performance.
MIMIC	Top-3 Accuracy	60.0	53.0	-7.0

Main Takeaways

HeartAgent significantly enhances diagnostic accuracy (>36% on MIMIC) compared to standard prompting techniques by leveraging specialized agents and external knowledge.
The system is highly effective at supporting clinicians, raising human expert accuracy by 26.9% in collaborative settings.
Ablation studies confirm the necessity of both the 'Generalist Examiner' (for breadth) and 'Specialist Reviewer' (for depth), as removing either leads to significant performance drops.
The reference verification module successfully grounds 92% of retrieved evidence, though it struggles with semantic mismatch between clinical notes and textbook terminology.

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of large language models (LLMs) and agentic workflows
Familiarity with RAG (Retrieval-Augmented Generation)
Medical terminology related to cardiology (differential diagnosis, ECG, comorbidities)

Key Terms

Differential Diagnosis: The process of distinguishing between two or more conditions that share similar signs or symptoms

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

RAG: Retrieval-Augmented Generation—enhancing model responses by fetching relevant external data (e.g., guidelines) during inference

BM25: A ranking function used in information retrieval to estimate the relevance of documents to a given search query

MedCPT: A pre-trained biomedical embedding model used for dense retrieval and re-ranking of medical texts

Agentic AI: Systems composed of autonomous modules (agents) that can plan, use tools, and collaborate to solve complex tasks

Self-verification: A process where an agent iteratively reviews its own outputs to correct errors or refine logic before final submission

Ablation study: Experiments where parts of the model are removed to assess their individual contribution to performance