VeriFact: Verifying Facts in LLM-Generated Clinical Text with Electronic Health Records

📝 Paper Summary

Factuality in Clinical NLP Hallucination suppression

VeriFact is an automated system that checks the factual accuracy of clinical text against a patient's electronic health record, achieving agreement with human ground truth comparable to inter-clinician agreement.

Core Problem

Clinicians cannot efficiently verify LLM-generated clinical text due to the 'needle-in-a-haystack' challenge of checking facts against massive longitudinal Electronic Health Records (EHRs).

Why it matters:

LLMs are increasingly used to summarize patient records, but hallucinations in medical contexts can be dangerous
Manual verification by clinicians is too time-consuming to be scalable
Existing benchmarks focus on general medical knowledge (QA exams) rather than patient-specific grounded fact-checking against EHR data

Concrete Example: An LLM might write 'patient underwent a transthoracic echocardiogram which showed no orbital cellulitis.' A clinician must dig through cardiology and ophthalmology notes to find this is a hallucination (an echo cannot see orbital cellulitis). VeriFact automates this retrieval and verification.

Key Novelty

VeriFact: Patient-Specific EHR Fact-Checking via Atomic Proposition Decomposition

Decomposes clinical narratives into atomic propositions (simple Subject-Object-Predicate statements) based on logical atomism
Retrieves relevant facts from the patient's entire longitudinal EHR using RAG to create a dynamic reference context
Uses an LLM-as-a-Judge to verify if each proposition is 'Supported', 'Not Supported', or 'Not Addressed' by the retrieved context

Architecture

The VeriFact pipeline: Input Text -> Decomposition (Sentences/Atomic Claims) -> Retrieval (from EHR Vector DB) -> Verification (LLM-as-a-Judge) -> Verdict.

Evaluation Highlights

Achieves 92.7% agreement with human ground truth on sentence propositions for LLM-written summaries (comparable to human-human agreement)
Releases VeriFact-BHC, a dataset of 13,290 clinician-annotated statements from 100 MIMIC-III patients
Atomic claim extraction reduces invalid propositions to 0.4% compared to 19.8% for simple sentence splitting in human-written notes

Breakthrough Assessment

8/10

Significant contribution to clinical NLP by providing both a strong baseline system that matches human performance and a high-quality, expert-annotated benchmark dataset for a critical, under-explored problem (patient-specific factuality).

⚙️ Technical Details

Problem Definition

Setting: Verify a candidate text T against a patient's Electronic Health Record E

Inputs: Candidate clinical text T (e.g., discharge summary), Patient EHR documents E

Outputs: A set of verdicts V for each statement in T (Supported, Not Supported, Not Addressed)

Pipeline Flow

Proposition Extraction (Text -> Atomic Claims/Sentences)
EHR Fact Extraction (EHR Notes -> Atomic Claims/Sentences Database)
Retrieval (Proposition -> Relevant EHR Facts)
Verification (Proposition + Retrieved Facts -> Verdict)

System Modules

Proposition Extractor

Decompose input text and EHR notes into verifyable units (sentences or atomic claims)

Model or implementation: GPT-4-Turbo (gpt-4-0125-preview)

Retriever

Find facts in the EHR that support or refute the candidate proposition

Model or implementation: Hybrid (BM25 + OpenAI text-embedding-3-small) with BGE-Reranker-v2-m3

Evaluator

Judge whether the proposition is supported by the retrieved context

Model or implementation: GPT-4-Turbo (gpt-4-0125-preview)

Novel Architectural Elements

Application of logical atomism to patient-specific EHR fact-checking (decomposing both the query text and the reference EHR into atomic facts)
Symmetric extraction pipeline where candidate text and reference corpus undergo identical decomposition

Modeling

Base Model: GPT-4-Turbo (gpt-4-0125-preview)

Comparison to Prior Work

vs. General Domain Fact-Checking: VeriFact verifies against a private, patient-specific corpus (EHR) rather than open knowledge (Wikipedia/Web)
vs. Existing Clinical Metrics: Moves beyond n-gram overlap to semantic verification of logical propositions

Limitations

Hardware constraints limited retrieval to 50 facts; performance did not saturate, suggesting more retrieval could help
Evaluation biased towards 'Not Supported' over 'Not Addressed' compared to humans, especially in information-asymmetric scenarios
Atomic claim extraction can sometimes fail on complex compound nouns (e.g., attributing orbital cellulitis to an echo)
Relies on proprietary models (GPT-4) which may have privacy implications for clinical data

Reproducibility

Code: https://github.com/som-shahlab/VeriFact

📊 Experiments & Results

Evaluation Setup

Verify statements from Discharge Summaries against MIMIC-III EHR notes

Benchmarks:

VeriFact-BHC (Fact Verification) [New]

Metrics:

Percent Agreement with Human Ground Truth
Sensitivity
Positive Predictive Value (PPV)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
VeriFact shows high agreement with human ground truth, particularly on LLM-generated summaries.
VeriFact-BHC (LLM-written)	Agreement with Ground Truth (Sentence propositions)	84.7	92.7	+8.0
VeriFact-BHC (LLM-written)	Agreement with Ground Truth (Atomic propositions)	88.5	88.8	+0.3
VeriFact-BHC (Human-written)	Agreement with Ground Truth (Sentence propositions)	66.6	66.0	-0.6

Experiment Figures

Impact of the number of retrieved facts on VeriFact's label assignment

Main Takeaways

VeriFact achieves human-level performance (approx. 88-92% agreement) when evaluating LLM-generated summaries against patient records
Atomic claim decomposition is essential for handling messy human-written clinical notes (0.4% invalid vs 19.8% for sentences)
Retrieving more facts (up to 50) and using hybrid retrieval + reranking significantly improves performance
LLM-as-a-Judge struggles to distinguish between 'Not Supported' (contradiction) and 'Not Addressed' (missing info) compared to humans, but performs well when these are merged

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
LLM-as-a-Judge
Electronic Health Records (EHR) structure
Propositional logic (atomic claims)

Key Terms

Atomic Claim: A declarative statement asserting a single fact, often in Subject-Object-Predicate format, used as the unit of verification

MIMIC-III: A widely used publicly available dataset of de-identified health data associated with intensive care unit admissions

BHC: Brief Hospital Course—a narrative summary section within a discharge summary describing the patient's stay

RAG: Retrieval-Augmented Generation—fetching relevant context from a database to aid an LLM's generation or evaluation

LLM-as-a-Judge: Using an LLM to evaluate the quality or correctness of text outputs, often replacing human annotation

Ground Truth: In this paper, a consensus label derived from majority voting and adjudication among physician annotators

Hybrid Retrieval: Combining keyword-based (sparse) and semantic (dense) search methods to improve retrieval relevance