VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts

📝 Paper Summary

Long-form factuality evaluation Fact extraction and verification Hallucination detection

VeriFact improves long-form factuality evaluation by using LLMs to refine extracted facts (adding context and relations) and introduces FactRBench to measure both precision and recall using reference fact sets.

Core Problem

Existing evaluation pipelines (decompose-decontextualize-verify) often fail on long-form text because they extract incomplete facts lacking necessary context or miss facts representing inter-sentence relations.

Why it matters:

Incomplete facts (e.g., 'price could drop 50%') without conditions (e.g., 'if demand disappears') lead to incorrect verification verdicts.
Current methods focus almost exclusively on precision, neglecting recall (coverage of relevant facts), which is critical for comprehensive assessment.
Fixed-K metrics like F1@K are question-agnostic and may misrepresent factual coverage for complex queries.

Concrete Example: For a query about gold prices, the SAFE method extracts 'The price of gold could drop by 20-50%', losing the critical condition 'if the demand for gold as jewelry were to disappear'. This omission causes the verifier to label the fact incorrectly. SAFE also misses the causal relation that this drop would happen 'making' it similar to other metals.

Key Novelty

VeriFact (Verification of refined Facts) & FactRBench

Introduces a refinement step in the extraction pipeline where LLMs explicitly detect and repair 'incomplete facts' (missing context/conditions) and 'missing facts' (overlooked relational info).
Creates FactRBench, a benchmark with reference fact sets derived from human answers and aggregated LLM outputs, enabling the calculation of recall alongside precision.
Releases full web pages used for verification to ensure reproducibility, unlike prior benchmarks that rely on transient search results.

Architecture

The VeriFact pipeline workflow.

Evaluation Highlights

VeriFact reduces the extraction of incomplete facts by 19.2% compared to the best comparison method (SAFE).
The refinement stage reduces the number of missing facts by 37%, capturing more inter-sentence dependencies.
Ensemble LLM annotation achieves 0.89 recall for detecting incomplete facts and 0.85 recall for missing facts.

Breakthrough Assessment

8/10

Significantly refines the standard factuality pipeline by addressing the 'context loss' problem in atomic fact extraction and rigorously introducing recall metrics, which are often ignored in hallucination research.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of long-form text generation for factuality (correctness) and coverage (recall) against external knowledge

Inputs: User prompt/question and a Long-form LLM response

Outputs: Set of verified atomic facts, Precision score, and Recall score against reference facts

Pipeline Flow

Decomposition (split response into atomic facts)
Detection (identify incomplete/missing facts using LLM judges)
Refinement (rewrite facts to add context; generate new facts for missing relations)
Verification (check facts against Google Search evidence)

System Modules

Fact Decomposer

Split long-form response into initial atomic facts

Model or implementation: Same methodology as SAFE (model not specified, likely GPT-4o)

Issue Detector (Refinement)

Identify incomplete facts (missing conditions/context) and missing relational facts

Model or implementation: Ensemble of GPT-4o, Llama 3.3-70B, Qwen 2.5-32B

Fact Refiner (Refinement)

Rewrite incomplete facts to be self-contained; generate new atomic facts for missing relations

Model or implementation: LLM (implied GPT-4o based on context)

Verifier

Verify correctness of refined facts against external evidence

Model or implementation: Llama 3.3-70B (Judge) + Google Search (Serper API)

Novel Architectural Elements

Reflection-based decontextualization loop: Explicitly detecting 'incomplete' and 'missing' facts using an ensemble of LLM judges before verification
Taxonomy-driven refinement: Uses specific categories (missing comparandum, omitted condition, temporal relation) to guide the rewriting process

Modeling

Base Model: Evaluated models include GPT-4o, Claude 3.5 Sonnet, Llama 3.1, Mistral, Qwen, etc. The pipeline itself uses GPT-4o, Llama 3.3-70B, and Qwen 2.5-32B as judges.

Comparison to Prior Work

vs. SAFE: VeriFact adds a refinement stage to fix incomplete/missing facts before verification, addressing context loss.
vs. FactScore: VeriFact supports open-domain queries via Google Search (not just Wikipedia) and evaluates recall.
vs. F1@K (SAFE metric): VeriFact calculates recall against a dynamic set of reference facts rather than a fixed K.
+ 1 more
vs. Gunjal and Durrett (2024) [cited]: VeriFact addresses non-entity incompleteness (conditions, relations), whereas prior work focused largely on entity ambiguity.

Limitations

Dependency on proprietary LLMs (GPT-4o) for parts of the pipeline and reference fact generation.
Google Search snippets may not always contain sufficient information for verification (though full pages are stored).
The recall metric relies on reference facts generated by other LLMs (for FactBench prompts), which might not be exhaustive.
Computational cost is higher than standard pipelines due to the ensemble detection and refinement steps.

Reproducibility

Code: https://huggingface.co/spaces/launch/factrbench

FactRBench is publicly available on Hugging Face. The benchmark includes 1096 prompts (from FactBench and Reddit), reference fact sets, and complete web pages retrieved during verification to ensure consistent future evaluation.

📊 Experiments & Results

Evaluation Setup

Evaluation of 12 frontier LLMs (open and closed weight) on FactRBench using VeriFact pipeline.

Benchmarks:

FactRBench (Long-form Question Answering) [New]

Metrics:

Precision (percentage of generated facts that are supported)
Recall (percentage of reference facts covered by generated facts)
F1 Score (harmonic mean of Precision and Recall)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
FactBench subset	Reduction in incomplete facts	Not explicitly reported as raw number, inferred from delta	Not explicitly reported as raw number	-19.2%
FactBench subset	Reduction in missing facts	Not explicitly reported as raw number	Not explicitly reported as raw number	-37%
FactBench subset	Recall of incomplete fact detection	Not reported in the paper	0.89	High

Experiment Figures

A motivating example comparing SAFE (existing method) vs. VeriFact.

Main Takeaways

Larger models within the same family (e.g., Llama 3.1 405B vs 8B) generally achieve better precision and recall.
High precision does not imply high recall; some models are cautious (high precision, low recall), emphasizing the need for both metrics.
Closed-weight models (GPT-4o, Claude 3.5) tend to have higher recall than open-weight models.
Large open-weight models (Llama 3.1-405B, Mistral-123B) are highly competitive with closed models, particularly in precision.

📚 Prerequisite Knowledge

Prerequisites

Understanding of the decompose-decontextualize-verify pipeline for fact-checking
Familiarity with atomic fact extraction from long text
Basic knowledge of Precision and Recall metrics

Key Terms

atomic claims: Individual, self-contained statements extracted from a longer text that can be independently verified

incomplete facts: Extracted claims that lack necessary context (e.g., conditions, comparandums) to be true or meaningful on their own

missing facts: Information present in the source text (often relational, like causality or timing) that is lost during the decomposition process

decontextualization: The process of rewriting a sentence fragment so it makes sense in isolation (e.g., resolving pronouns like 'he' to 'Barack Obama')

PDTB: Penn Discourse TreeBank—a corpus annotating discourse relations; used here to categorize missing relational facts like temporal or contingency connections

SAFE: Search-Augmented Factuality Evaluator—a baseline method that extracts and verifies facts using Google Search

FactCheck-GPT: A framework for checking LLM factuality that serves as the verification backbone for this paper

recall: In this context, the proportion of 'reference facts' (ground truth facts relevant to the query) that are successfully covered by the model's response