← Back to Paper List

MedScore: Generalizable Factuality Evaluation of Free-Form Medical Answers by Domain-adapted Claim Decomposition and Verification

Heyuan Huang, Alexandra DeLucia, Vijay Murari Tiyyala, Mark Dredze
Center for Language and Speech Processing, Johns Hopkins University
arXiv (2025)
Factuality Benchmark RAG QA

📝 Paper Summary

Medical Factuality Evaluation Decompose-then-Verify Frameworks
MedScore is a domain-adapted factuality evaluation pipeline for medical text that improves claim decomposition by preserving context and subjectivity, addressing failures of general-purpose biography-trained evaluators.
Core Problem
Existing decompose-then-verify factuality systems are trained on objective, formulaic biographies (e.g., Wikipedia), failing on medical answers which are subjective, conditional, and structurally complex.
Why it matters:
  • Incorrect medical information from LLMs can cause serious patient harm, making reliable factuality evaluation critical
  • Current systems like FActScore over-decompose medical text into invalid claims (17% validity) or fail to capture empathy and advice
  • Medical answers contain complex 'molecular' facts (conditionals, suggestions) rather than just 'atomic' facts (dates, entities), requiring specialized handling
Concrete Example: A medical response might say 'If the pain gets worse, you should call a doctor.' FActScore might split this into context-free atomic atoms that lose the conditional 'If...', while MedScore preserves the dependency as a single verifiable unit.
Key Novelty
MedScore (Medical-adapted Decompose-then-Verify)
  • Introduces a new taxonomy for medical claim decomposition that handles imperatives, conditionals, and subjective 'bedside manner' differently from objective facts
  • Uses a context-aware decomposition prompt that extracts 'molecular' facts (retaining necessary modifiers) rather than stripping sentences down to 'atomic' entities
  • Verifies claims against three distinct sources: internal parametric knowledge, original doctor responses (gold standard), and retrieved medical literature
Evaluation Highlights
  • MedScore extracts up to 3x more valid facts than existing methods like FActScore
  • Achieves significantly higher valid claim rate (95%) compared to FActScore (17%) and Core (67%) on the AskDocsAI dataset
  • Reduces the '0-claim rate' (responses where no facts are found) to near zero, whereas VeriScore fails to find claims in ~15% of responses
Breakthrough Assessment
7/10
Strong practical contribution for the high-stakes medical domain. Successfully adapts the decompose-then-verify paradigm to handle subjective/conditional text, though the core architectural innovation is primarily in prompting strategies and taxonomy application rather than new model architectures.
×