MedScore: Generalizable Factuality Evaluation of Free-Form Medical Answers by Domain-adapted Claim Decomposition and Verification

📝 Paper Summary

Medical Factuality Evaluation Decompose-then-Verify Frameworks

MedScore is a domain-adapted factuality evaluation pipeline for medical text that improves claim decomposition by preserving context and subjectivity, addressing failures of general-purpose biography-trained evaluators.

Core Problem

Existing decompose-then-verify factuality systems are trained on objective, formulaic biographies (e.g., Wikipedia), failing on medical answers which are subjective, conditional, and structurally complex.

Why it matters:

Incorrect medical information from LLMs can cause serious patient harm, making reliable factuality evaluation critical
Current systems like FActScore over-decompose medical text into invalid claims (17% validity) or fail to capture empathy and advice
Medical answers contain complex 'molecular' facts (conditionals, suggestions) rather than just 'atomic' facts (dates, entities), requiring specialized handling

Concrete Example: A medical response might say 'If the pain gets worse, you should call a doctor.' FActScore might split this into context-free atomic atoms that lose the conditional 'If...', while MedScore preserves the dependency as a single verifiable unit.

Key Novelty

MedScore (Medical-adapted Decompose-then-Verify)

Introduces a new taxonomy for medical claim decomposition that handles imperatives, conditionals, and subjective 'bedside manner' differently from objective facts
Uses a context-aware decomposition prompt that extracts 'molecular' facts (retaining necessary modifiers) rather than stripping sentences down to 'atomic' entities
Verifies claims against three distinct sources: internal parametric knowledge, original doctor responses (gold standard), and retrieved medical literature

Architecture

The MedScore decompose-then-verify pipeline for a single sentence.

Evaluation Highlights

MedScore extracts up to 3x more valid facts than existing methods like FActScore
Achieves significantly higher valid claim rate (95%) compared to FActScore (17%) and Core (67%) on the AskDocsAI dataset
Reduces the '0-claim rate' (responses where no facts are found) to near zero, whereas VeriScore fails to find claims in ~15% of responses

Breakthrough Assessment

7/10

Strong practical contribution for the high-stakes medical domain. Successfully adapts the decompose-then-verify paradigm to handle subjective/conditional text, though the core architectural innovation is primarily in prompting strategies and taxonomy application rather than new model architectures.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of free-form medical Question Answering (QA) generations

Inputs: A generated medical response R (and optionally the patient question Q)

Outputs: A scalar factuality score in [0.0, 1.0] representing the proportion of supported claims

Pipeline Flow

Input Response
Decomposition (LLM splits text into claims based on MedScoreTaxonomy)
Verification (Claims checked against Knowledge Source)
Scoring (Average of verified claims)

System Modules

Decomposer

Split medical answers into self-contained, valid claims

Model or implementation: GPT-4o-mini

Retriever (Verification)

Fetch relevant medical evidence (if using External Corpus verification)

Model or implementation: MedCPT (Retriever) + MedRAG (System)

Verifier (Verification)

Judge if the claim is supported by the evidence or internal knowledge

Model or implementation: Mistral-Small-24B-Instruct-2501 (Mistral Small 3)

Novel Architectural Elements

Taxonomy-driven decomposition module specifically engineered to output 'molecular' facts (preserving conditionals/imperatives) rather than standard 'atomic' facts

Modeling

Base Model: GPT-4o-mini (Decomposition), Mistral-Small-24B-Instruct-2501 (Verification)

Training Method: Prompt engineering and In-context Learning (10-shot)

Adaptation: None (Inference-only pipeline)

Trainable Parameters: None

Key Hyperparameters:

retrieval_top_k: 10
decomposition_shots: 10

Compute: Not reported in the paper

Comparison to Prior Work

vs. FActScore: MedScore uses taxonomy-guided prompting to retain conditionals/imperatives, whereas FActScore atomizes them, losing meaning (17% valid vs 95% valid)
vs. VeriScore: MedScore creates 3x more claims and has near-zero omission, whereas VeriScore is overly conservative (14.67% 0-claim rate)
vs. Core: MedScore filters via generation constraints, whereas Core's post-filtering indiscriminately removes both valid and invalid claims

Limitations

Reliance on proprietary models (GPT-4o-mini) for the decomposition step
Retrieval corpus (MedCorp) quality directly bottlenecks verification accuracy
Manual annotation for evaluation is resource-intensive and subjective
Verification using LLMs as judges can still suffer from their own hallucinations or reasoning failures

Reproducibility

Code: https://github.com/Heyuan9/MedScore

publicly available (https://github.com/Heyuan9/MedScore). The repository contains the code and the AskDocsAI dataset. The CaLMQA dataset subset is also provided. Verification relies on closed-source (GPT-4o) and open weights (Mistral Small 3) models.

📊 Experiments & Results

Evaluation Setup

Factuality evaluation of medical chatbot responses

Benchmarks:

AskDocsAI (Medical QA (Reddit-based)) [New]
PUMA (Medical QA (Yahoo! Answers))
CaLMQA (Cultural/General QA (Reddit ELI5))

Metrics:

Factuality Score (0.0-1.0)
Valid Claim Rate (%)
0-claim Rate (%)
Average number of claims per response
Statistical methodology: Cohen's kappa (0.73) reported for inter-annotator agreement on claim quality.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Valid Claim Rate analysis shows MedScore generates significantly higher quality claims than baselines on AskDocsAI.
AskDocsAI	Valid Claim Rate	17.0	95.0	+78.0
AskDocsAI	Valid Claim Rate	67.0	95.0	+28.0
Claim coverage analysis (Claims per response) shows MedScore captures more information than the conservative VeriScore.
AskDocsAI	Claims per Response	5.62	15.06	+9.44
AskDocsAI	0-claim Rate (%)	14.67	0.33	-14.34
Factuality Score results depend heavily on the verification backend. Using Internal Knowledge (Mistral Small 3) shows variance.
AskDocsAI	Factuality Score (Internal-Mistral)	39.6	66.5	+26.9

Main Takeaways

FActScore dramatically under-performs on medical text because it atomizes complex advice into invalid fragments (only 17% valid claims).
VeriScore is high-precision but low-recall (high 0-claim rate), missing critical medical information by being over-conservative.
The 'molecular' decomposition approach of MedScore retains necessary context (conditionals, imperatives), resulting in higher human-rated validity (95%).
Factuality scores are highly sensitive to the verifier model and corpus; FActScore consistently underestimates factuality due to poor decomposition.

📚 Prerequisite Knowledge

Prerequisites

Understanding of the 'decompose-then-verify' framework (e.g., FActScore)
Familiarity with RAG (Retrieval-Augmented Generation)
Basic knowledge of LLM evaluation metrics (LLM-as-a-Judge)

Key Terms

Decompose-then-verify: An evaluation strategy where long text is broken into individual claims ('atoms') which are then independently checked for truthfulness

Atomic facts: Simple, indivisible statements (e.g., 'Obama was born in Hawaii') used in standard evaluation

Molecular facts: Complex facts retaining conditionals, modifiers, or dependency structures (e.g., 'If X happens, do Y') essential for medical advice

FActScore: A popular framework for evaluating factuality by breaking text into atomic claims, originally designed for biographies

AskDocsAI: A new dataset introduced in this paper containing medical Q&A pairs from Reddit's r/AskDocs with LLM-augmented doctor answers

MedCorp: A medical retrieval corpus compiled in this paper consisting of PubMed, StatPearls, and Textbooks

0-claim rate: The percentage of generated responses for which the evaluation system fails to extract a single verifiable claim

Internal Knowledge: Verification using the model's own pre-trained parameters (parametric knowledge) without external retrieval

MedRAG: A specific retrieval-augmented generation toolkit used here to fetch evidence from medical corpora

MedCPT: A medical-specific dense retriever model used to rank passages from MedCorp