FactLens: Benchmarking Fine-Grained Fact Verification

📝 Paper Summary

Fact Verification Hallucination detection

FactLens improves fact-checking by decomposing complex claims into atomic sub-claims and evaluating them with new metrics like atomicity and sufficiency to pinpoint nuanced errors.

Core Problem

Traditional fact-verification models assign a single holistic label to complex claims, often obscuring nuanced errors or inaccuracies buried within multi-part statements.

Why it matters:

Holistic labels fail to identify exactly which part of a complex claim is incorrect, reducing interpretability
Existing benchmarks lack fine-grained labels, making it difficult to evaluate the precise failure modes of LLMs
Poorly constructed sub-claims (e.g., losing context or fabricating details) can degrade verification performance rather than help it

Concrete Example: For the claim 'Amanda Bauer attended the University of Cincinnati. The school’s nickname is Bearcats,' a poor decomposition might output 'The school’s nickname is Bearcats.' This sub-claim lacks sufficiency because the reference to 'University of Cincinnati' is lost, making it ambiguous for verification.

Key Novelty

FactLens: A Fine-Grained Verification Benchmark & Evaluator

Decomposes complex claims into atomic sub-claims to isolate factual errors, rather than verifying the whole sentence at once
Introduces specific quality metrics (Atomicity, Sufficiency, Fabrication, Coverage) to judge whether a sub-claim is well-formed before verification
Provides a manually curated dataset of 733 instances with ground-truth sub-claims to benchmark decomposition models

Architecture

Conceptual comparison between Holistic Verification and Fine-Grained Verification pipelines

Evaluation Highlights

FactLens automated evaluators achieve fair-to-moderate correlation with human judgments across quality dimensions like Atomicity and Coverage
End-to-end analysis shows that sub-claims with 'low' fabrication scores lead to higher downstream F1 verification scores compared to 'high' fabrication ones
Current SOTA models (GPT-4o, LLaMA-3.1) struggle with Atomicity, often failing to split claims with one subject and multiple objects

Breakthrough Assessment

7/10

Important step towards more granular and explainable fact-checking. The metrics and dataset are valuable contributions, though the reliance on LLMs for evaluation (despite calibration) is a known limitation.

⚙️ Technical Details

Problem Definition

Setting: Given a complex claim C and evidence E, decompose C into sub-claims {c_1, c_2, ..., c_n} and verify each c_i independently.

Inputs: Complex claim (text), Context/Evidence (text/tables)

Outputs: List of sub-claims, Quality scores for sub-claims, Verification label (True/False) per sub-claim

Pipeline Flow

Claim Decomposition (LLM splits claim)
Sub-claim Evaluation (FactLens Evaluator checks quality)
Verification (Verifier checks sub-claims against evidence)

System Modules

Claim Decomposer

Break down complex claims into smaller sub-claims

Model or implementation: GPT-4o or LLaMA-3.1-405B

FactLens Evaluator

Assess the quality of generated sub-claims

Model or implementation: Ensemble of LLM-based scoring and statistical/rule-based scoring

Verifier

Verify the truthfulness of each sub-claim against ground truth evidence

Model or implementation: GPT-4o-mini

Novel Architectural Elements

FactLens Evaluator module: Integrates dimension-specific quality checks (Atomicity, Sufficiency, etc.) directly into the pipeline to filter or judge decompositions before verification

Modeling

Base Model: GPT-4o and LLaMA-3.1-405B (for decomposition experiments)

Comparison to Prior Work

vs. CoverBench: Adds fine-grained sub-claim labels and specific metrics for decomposition quality
vs. Standard Fact-Checking: Shifts from single-label prediction to multi-label sub-claim verification to catch partial hallucinations

Limitations

Relies on LLM-as-a-judge for evaluation metrics, which can be inconsistent or biased
Statistical metrics depend on entity extraction accuracy, which is not perfect
Verification experiments use ground-truth evidence provided by datasets, skipping the retrieval step essential in real-world RAG

Reproducibility

Code: https://github.com/megagonlabs/factlens

📊 Experiments & Results

Evaluation Setup

Fine-grained verification of 733 complex claims from CoverBench

Benchmarks:

FactLens (derived from CoverBench) (Claim Decomposition and Verification) [New]

Metrics:

Atomicity (1-3 scale)
Sufficiency (1-3 scale)
Fabrication (1-3 scale)
Coverage (1-3 scale)
Redundancy (1-3 scale)
Readability (1-3 scale)
Pearson Correlation (Human vs. Automated Evaluator)
F1 score (Downstream verification performance)
Statistical methodology: Pearson Correlation reported for Human-Automated agreement

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Correlation analysis validates that FactLens automated evaluators align reasonably well with human judgments, though subjectivity in 'Sufficiency' lowers agreement.
Synthetic Dataset	Pearson Correlation (Atomicity)	0	0.45	+0.45
Synthetic Dataset	Pearson Correlation (Coverage)	0	0.60	+0.60
Synthetic Dataset	Pearson Correlation (Sufficiency)	0	0.27	+0.27
Model performance on decomposition shows high sufficiency and coverage but struggles significantly with atomicity.
FactLens (CoverBench subset)	Atomicity Score (1-3)	3.00	1.89	-1.11
FactLens (CoverBench subset)	Sufficiency Score (1-3)	3.00	2.98	-0.02

Experiment Figures

Bar charts showing downstream verification F1 scores grouped by the quality score (Low/Medium/High) of the generated sub-claims

Main Takeaways

Sub-claim quality directly impacts verification accuracy: high fabrication in sub-claims correlates with lower F1 scores in downstream verification
Current LLMs (GPT-4o, LLaMA-3.1) are good at preserving context (Sufficiency) but bad at making sub-claims truly atomic (Atomicity)
Fine-grained verification offers better transparency by isolating specific errors, even if it introduces the complexity of decomposition quality assurance

📚 Prerequisite Knowledge

Prerequisites

Understanding of Fact Verification (Fact-Checking) pipelines
Familiarity with LLM hallucination issues
Basic knowledge of Natural Language Inference (NLI)

Key Terms

Atomicity: A metric measuring whether a sub-claim represents a single factual unit (one relation between subject and object) rather than multiple facts.

Sufficiency: A metric measuring whether a sub-claim is unambiguous and retains enough context from the original claim to be verified independently.

Fabrication: A metric checking if the decomposition process introduced new, made-up information not present in the original claim.

Coverage: A metric assessing whether the list of sub-claims captures all factual assertions present in the original complex claim.

Redundancy: A metric checking if the generated sub-claims contain repetitive information, which wastes compute and skews error rates.

CoverBench: The source dataset (Jacovi et al., 2024) providing the original complex claims used to build FactLens.

GPT-4o-mini: The specific LLM used as the 'verifier' model in this paper to check the truthfulness of sub-claims against evidence.