← Back to Paper List

Towards Effective Extraction and Evaluation of Factual Claims

Dasha Metropolitansky, Jonathan Larson
Microsoft Research
Annual Meeting of the Association for Computational Linguistics (2025)
Factuality Benchmark QA

📝 Paper Summary

Factuality Evaluation Claim Extraction Hallucination suppression
A new framework for evaluating claim extraction based on entailment, element-level coverage, and outcome-based decontextualization, accompanied by Claimify, a method that explicitly handles ambiguity.
Core Problem
Fact-checking long-form LLM content requires extracting simple claims, but existing extraction methods lack a standardized evaluation framework and often misrepresent text by ignoring ambiguity or omitting context.
Why it matters:
  • Inaccurate or incomplete claim extraction compromises downstream fact-checking, leading to misleading or false verdicts
  • Current evaluation methods rely on subjective human judgments or simplistic metrics like atomicity, which don't correlate with verification performance
  • Existing extractors force-resolve ambiguous sentences or strip necessary context, creating 'hallucinated' claims that the original text did not support
Concrete Example: Sentence: 'John Smith supports government regulations.' A standard extractor might output this as a standalone claim. However, if the full text says 'In Jane Doe's podcast on EVs, Smith supports regulations,' the missing context might lead a fact-checker to retrieve irrelevant evidence about Smith's views on healthcare, resulting in a false verdict.
Key Novelty
Claimify: Ambiguity-Aware Claim Extraction & Outcome-Based Evaluation
  • Introduces 'Element-Level Coverage': Evaluates if verifiable information bits are captured while penalizing the inclusion of unverifiable content, unlike binary sentence-level metrics
  • Proposes 'Outcome-Based Decontextualization': Instead of asking humans if a claim makes sense alone, it checks if adding context changes the automated fact-checking verdict
  • Claimify Method: A pipeline that explicitly identifies referential and structural ambiguity, refusing to extract claims if the correct interpretation cannot be resolved from context
Evaluation Highlights
  • Claimify achieved 99.0% claim entailment, statistically tying with the best baseline (VeriScore) while significantly outperforming DnD (89.1%)
  • On element-level coverage, Claimify reached 87.9% accuracy, surpassing the next best method (DnD at 76.9%) by a wide margin
  • In decontextualization tests using Google Search, Claimify produced desirable outcomes in 80.6% of cases, significantly higher than all baselines (next best: DnD at 78.4%)
Breakthrough Assessment
8/10
Strong contribution to evaluation methodology (outcome-based decontextualization is clever) and a solid new method (Claimify) that addresses the overlooked problem of ambiguity.
×