Towards Effective Extraction and Evaluation of Factual Claims

📝 Paper Summary

Factuality Evaluation Claim Extraction Hallucination suppression

A new framework for evaluating claim extraction based on entailment, element-level coverage, and outcome-based decontextualization, accompanied by Claimify, a method that explicitly handles ambiguity.

Core Problem

Fact-checking long-form LLM content requires extracting simple claims, but existing extraction methods lack a standardized evaluation framework and often misrepresent text by ignoring ambiguity or omitting context.

Why it matters:

Inaccurate or incomplete claim extraction compromises downstream fact-checking, leading to misleading or false verdicts
Current evaluation methods rely on subjective human judgments or simplistic metrics like atomicity, which don't correlate with verification performance
Existing extractors force-resolve ambiguous sentences or strip necessary context, creating 'hallucinated' claims that the original text did not support

Concrete Example: Sentence: 'John Smith supports government regulations.' A standard extractor might output this as a standalone claim. However, if the full text says 'In Jane Doe's podcast on EVs, Smith supports regulations,' the missing context might lead a fact-checker to retrieve irrelevant evidence about Smith's views on healthcare, resulting in a false verdict.

Key Novelty

Claimify: Ambiguity-Aware Claim Extraction & Outcome-Based Evaluation

Introduces 'Element-Level Coverage': Evaluates if verifiable information bits are captured while penalizing the inclusion of unverifiable content, unlike binary sentence-level metrics
Proposes 'Outcome-Based Decontextualization': Instead of asking humans if a claim makes sense alone, it checks if adding context changes the automated fact-checking verdict
Claimify Method: A pipeline that explicitly identifies referential and structural ambiguity, refusing to extract claims if the correct interpretation cannot be resolved from context

Evaluation Highlights

Claimify achieved 99.0% claim entailment, statistically tying with the best baseline (VeriScore) while significantly outperforming DnD (89.1%)
On element-level coverage, Claimify reached 87.9% accuracy, surpassing the next best method (DnD at 76.9%) by a wide margin
In decontextualization tests using Google Search, Claimify produced desirable outcomes in 80.6% of cases, significantly higher than all baselines (next best: DnD at 78.4%)

Breakthrough Assessment

8/10

Strong contribution to evaluation methodology (outcome-based decontextualization is clever) and a solid new method (Claimify) that addresses the overlooked problem of ambiguity.

⚙️ Technical Details

Problem Definition

Setting: Given a question-answer pair, extract a set of decontextualized, verifiable factual claims such that they fully cover the source's factual content without hallucination.

Inputs: Question-Answer pair (long-form text)

Outputs: List of atomic, decontextualized factual claims

Pipeline Flow

Sentence Splitting & Context Creation
Selection (Verifiability Check)
Disambiguation (Ambiguity Resolution)
Decomposition (Claim Extraction)

System Modules

Sentence Splitter

Split answer into sentences and attach context (preceding/following sentences)

Model or implementation: NLTK sentence tokenizer

Selector

Determine if a sentence contains verifiable content; rewrite to remove unverifiable parts if mixed

Model or implementation: LLM (e.g., GPT-4o)

Disambiguator

Identify referential/structural ambiguity; resolve using context if possible, or flag as unresolvable

Model or implementation: LLM (e.g., GPT-4o)

Decomposer

Break disambiguated sentence into atomic, decontextualized claims

Model or implementation: LLM (e.g., GPT-4o)

Novel Architectural Elements

Explicit Disambiguation Stage: A dedicated module that halts extraction if ambiguity cannot be confidently resolved, rather than guessing
Bracketed Context Notation: Uses brackets to explicitly mark information inferred from context vs. stated in text

Modeling

Base Model: Evaluated using gpt-4o-2024-08-06, mistral-large-2411, and DeepSeek-V3

Compute: Not reported in the paper (Inference-only method)

Comparison to Prior Work

vs. VeriScore/DnD/SAFE: Claimify includes an explicit disambiguation step to handle unresolvable ambiguity
vs. Molecular Facts [not cited in paper]: Claimify resolves ambiguity using only local context/question, whereas Molecular Facts uses the model's parametric knowledge (risk of hallucination)
vs. AFaCTA/Factcheck-GPT: Claimify performs full extraction, not just classification

Limitations

Evaluated on a single dataset (BingCheck), though it covers diverse topics
Does not handle temporal ambiguity where time is simply missing (e.g., 'unemployment decreased' without a date)
Relies on LLM performance; evaluated primarily with GPT-4o
Hyperparameters (context window size) were not exhaustively tuned

Reproducibility

Prompt templates for all stages (Selection, Disambiguation, Decomposition) are provided in Appendix N.1. The BingCheck dataset and annotation guidelines are described. Code URL is not explicitly provided.

📊 Experiments & Results

Evaluation Setup

Comparison of extracted claims from 396 BingCheck answers generated by Microsoft Copilot.

Benchmarks:

BingCheck (Long-form Question Answering)

Metrics:

Entailment %
Sentence-Level Coverage (Accuracy/F1)
Element-Level Coverage (Accuracy/F1)
Decontextualization (Percentage of 'Desirable' outcomes)
Statistical methodology: Two-proportion Z-tests with Holm-Bonferroni correction

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Entailment results measure if the extracted claims are supported by the source text.
BingCheck	Entailment %	89.1	99.0	+9.9
Coverage results measure how well verifiable content is captured (Recall) and unverifiable content excluded (Precision).
BingCheck	Sentence-Level Accuracy	81.6	91.8	+10.2
BingCheck	Element-Level Macro F1	56.2	83.7	+27.5
Decontextualization results measure if the claim is self-contained enough to produce the same verification verdict as the fully contextualized version.
BingCheck	% Desirable Results (Google Search)	78.4	80.6	+2.2
BingCheck	% Desirable Results (Bing Search)	79.3	80.5	+1.2

Main Takeaways

Claimify consistently achieves the best balance of high entailment and high coverage, avoiding the trade-off seen in baselines (e.g., VeriScore has high entailment but poor coverage).
The 'Selection' stage is critical; removing it drops element-level coverage F1 from 83.7% to 54.4%, showing the importance of pre-filtering unverifiable content.
Current NLI models are insufficient for evaluating claim entailment; a custom LLM prompt aligned much better with human judgment.
Outcome-based decontextualization evaluation reveals that seemingly 'decontextualized' claims often fail to retrieve the correct evidence unless rigorously checked against a max-context version.

📚 Prerequisite Knowledge

Prerequisites

Fact-checking pipelines (Decompose-then-verify)
Natural Language Inference (NLI)
Precision/Recall/F1 metrics

Key Terms

Decontextualization: Rewriting a sentence so it stands alone (e.g., replacing pronouns with entities) while retaining its original meaning

Entailment: The logical requirement that if the source text is true, the extracted claim must also be true

Element-level coverage: A granular metric checking if specific verifiable 'elements' (facts) within a sentence are present in the extracted claims

Referential ambiguity: When it is unclear what a word or phrase (like 'They' or 'The policy') refers to

Structural ambiguity: When grammatical structure allows for multiple interpretations (e.g., 'A and B at C' vs 'A and (B at C)')

FActScore: A metric/framework that decomposes long-form generations into atomic claims to estimate factuality

NLI: Natural Language Inference—determining if a hypothesis is true given a premise