Assessing Automated Fact-Checking for Medical LLM Responses with Knowledge Graphs

📝 Paper Summary

Factuality evaluation Medical Knowledge Graphs (KGs)

FAITH is an unsupervised, reference-free framework that evaluates the factuality of medical LLM responses by decomposing them into atomic claims and verifying them against paths in a medical knowledge graph.

Core Problem

Deploying LLMs in healthcare requires rigorous factuality verification, but traditional metrics like BLEU require reference answers (often unavailable) and correlate poorly with clinician judgments of factual accuracy.

Why it matters:

LLMs frequently produce plausible but dangerously inaccurate medical information (hallucinations), undermining trust and patient safety
Clinical studies are too slow to keep pace with LLM development, necessitating automated evaluation methods
Existing model-based evaluators (LLM-as-a-judge) are themselves prone to hallucinations and lack explainability regarding why a claim is rejected

Concrete Example: An LLM might claim 'dry cough is a symptom of bronchiectasis'. Traditional metrics check lexical overlap with a reference, ignoring the specific medical fact. FAITH extracts this claim, maps 'dry cough' and 'bronchiectasis' to KG nodes, and verifies if a valid evidence path exists between them.

Key Novelty

Knowledge Graph-Grounded Reference-Free Evaluation

Decomposes text into atomic claims and maps entities to a standard medical ontology (UMLS) rather than relying on text overlap or another LLM's opinion
Scores factuality by finding the shortest path between entities in the KG and assessing the semantic congruence of the path's relations to the claim's predicate
Incorporates entity centrality (PageRank) and relationship co-occurrence patterns to penalize generic or weak associations

Architecture

The complete FAITH pipeline, illustrating the process from an LLM response to a final factuality score using a Knowledge Graph.

Evaluation Highlights

Achieves Pearson correlation of 0.696 with human clinician judgments, significantly outperforming BLEU-4 (0.081) and GPT-4o-based metrics
Robust to paraphrasing with a coefficient of variation of 0.014 ± 0.005, compared to 0.910 ± 0.862 for BLEU-4
Pinpoints erroneous statements identified by clinicians with a precision of 0.65 and recall of 0.59

Breakthrough Assessment

8/10

Significant improvement in correlation with human experts for medical fact-checking without needing reference answers. The approach offers high explainability, though it relies heavily on the completeness of the underlying Knowledge Graph.

⚙️ Technical Details

Problem Definition

Setting: Given a generated response D and a Knowledge Graph G, determine the factuality score of extracted atomic claims

Inputs: LLM-generated medical response text

Outputs: Aggregated factuality score [-1, 1] and per-claim verification paths

Pipeline Flow

Group 1: Extraction: LLM Response -> [Medical Claim Extraction] -> [Medical Entity Matching]
Group 2: Verification: [KG Traversal] -> [Factual Evaluation (Scoring)] -> Final Score

System Modules

Medical Claim Extraction (Extraction)

Decompose response into atomic triplets (subject, predicate, object) using GPT-4o with multi-phase prompting

Model or implementation: GPT-4o (with 5-shot in-context learning)

Medical Entity Matching (Extraction)

Map text entities to standardized Knowledge Graph nodes (CUIs)

Model or implementation: UMLS API (Entity Resolution)

KG Traversal (Verification)

Find the shortest evidence path between the subject and object nodes in the KG

Model or implementation: Shortest path algorithm on UMLS KG

Factual Evaluation (Scoring) (Verification)

Compute factuality score based on path length, semantic similarity, entity centrality, and relation co-occurrence

Model or implementation: Mathematical scoring function (Eq. 2 in paper)

Novel Architectural Elements

Unsupervised scoring function combining path structure, semantic similarity of relations, and graph centrality metrics (PageRank)
Conservative filtering: explicitly labeling unmatched entities as 'unverifiable' rather than hallucinating a match

Modeling

Base Model: GPT-4o (used for claim extraction)

Compute: Not reported in the paper

Comparison to Prior Work

vs. FActScore: Uses structured KG paths instead of unstructured retrieval; does not require a reference corpus
vs. MEDCON/imapScore: Reference-free; evaluates validity of the claim itself rather than similarity to a gold standard
vs. G-Eval: Grounded in external verified medical ontology (UMLS) rather than LLM parametric knowledge; provides interpretable evidence paths
+ 1 more
vs. Self-CheckGPT [not cited in paper]: FAITH uses external ground truth (KG) rather than checking consistency across the model's own sampled outputs

Limitations

Reliability depends heavily on the coverage and quality of the Knowledge Graph (UMLS); unmatched entities are ignored
Shortest path assumption may not always reflect the most medically relevant connection
Computational cost of graph traversal for every claim in a response
Requires access to proprietary GPT-4o for the claim extraction step

Reproducibility

Code: https://zenodo.org/records/17603819

Code availability is listed as provided in Zenodo (https://zenodo.org/records/17603819). The method uses GPT-4o (closed source) for extraction and the UMLS 2025AA database (requires license). Prompt strategies are described in Appendix B.1.

📊 Experiments & Results

Evaluation Setup

Medical Question Answering (QA), Summarization, and Fact Verification

Benchmarks:

MedQA (Multiple-choice QA)
MMLU (Medical subset) (Multiple-choice QA)
MS-AKT (Medical QA)
LiveQA (Open-ended QA)
FactPICO (Medical Summarization)

Metrics:

Pearson correlation (rho) with clinician judgments
Coefficient of Variation (CV) for robustness
Accuracy (when used for filtering)
Precision/Recall/F1 for error localization
Statistical methodology: Paired t-test for distinguishing LLM capabilities; Cohen’s kappa for inter-rater agreement

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Correlation with human expert judgments shows FAITH significantly outperforms reference-based and LLM-based metrics.
MS-AKT (Clinician subset)	Pearson Correlation	0.081	0.696	+0.615
MS-AKT (Clinician subset)	Pearson Correlation	0.380	0.696	+0.316
MS-AKT (Clinician subset)	Pearson Correlation	0.550	0.696	+0.146
Robustness analysis demonstrates FAITH is far less sensitive to phrasing variations than lexical metrics.
Paraphrased Responses	Coefficient of Variation (CV)	0.910	0.014	-0.896
Application results show FAITH can be used to filter bad answers (Reject-to-Answer) or trigger RAG.
MedQA	Answer Accuracy	69.5	86.3	+16.8
FactPICO	Pearson Correlation	0.33	0.61	+0.28

Experiment Figures

Scatter plots correlating automated metric scores with human clinician ratings.

Analysis of FAITH's interpretability and common LLM error types.

Main Takeaways

FAITH aligns much better with clinician judgments (rho=0.696) than traditional NLP metrics or LLM-based evaluators.
The method is highly robust to surface-level textual variations (paraphrasing), focusing on the underlying medical facts.
It effectively distinguishes between LLMs of varying medical capabilities (e.g., GPT-4o vs Llama 3), which some metrics like ROUGE failed to do.
The framework provides granular explainability, correctly identifying the specific claims clinicians flagged as erroneous in 83.6% of cases (top-5 ranking).

📚 Prerequisite Knowledge

Prerequisites

Knowledge Graphs (entities, relations, paths)
Basic NLP (Named Entity Recognition, Relation Extraction)
Information Retrieval metrics (Precision, Recall)

Key Terms

UMLS: Unified Medical Language System—a comprehensive ontology linking synonymous medical terms from over 200 biomedical vocabulary sources

Atomic claim: A single factual statement decomposed into a triplet (subject, predicate, object)

Knowledge path: A sequence of alternating entities and relations connecting two nodes in a graph

CUI: Concept Unique Identifier—a code used in UMLS to represent a specific medical concept regardless of the exact term used

PageRank: An algorithm used here to measure the centrality or importance of an entity node within the knowledge graph

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents/facts

RTA: Reject-to-Answer—a safety mechanism where the model refuses to answer a query if its confidence or factuality score is below a threshold

BLEU: Bilingual Evaluation Understudy—a metric measuring text overlap between generated output and a reference answer

BERTScore: A metric evaluating semantic similarity between candidate and reference text using contextual embeddings