VeriScore: Evaluating the factuality of verifiable claims in long-form text generation

📝 Paper Summary

Factuality evaluation Long-form text generation Hallucination detection

VeriScore improves long-form factuality evaluation by extracting only verifiable atomic claims (filtering out subjective or unverifiable content) and verifying them against Google Search results using fine-tuned open-weight models.

Core Problem

Existing factuality metrics like FActScore and SAFE assume all generated text can be decomposed into verifiable atomic claims, leading to errors when texts contain subjective, unverifiable, or complex content.

Why it matters:

Current metrics extract unverifiable content (e.g., opinions, fictional stories) as claims, unfairly penalizing models when these cannot be verified
Metrics optimized for biographies (like FActScore) fail on diverse tasks (like LFQA) where claims are not atomic or contain complex inter-sentence dependencies
Relying solely on expensive closed-source models (GPT-4) for evaluation is costly and hinders reproducibility

Concrete Example: In a generated text 'Betacyanin is like a superhero cape', existing metrics like SAFE extract 'Betacyanin is like a superhero cape' as a factual claim to verify, which is metaphorical and unverifiable, leading to a false penalty.

Key Novelty

VeriScore: Evaluating Verifiable Claims

selectively extracts only 'verifiable claims' (statements describing specific events or states) rather than decomposing the entire text
uses a sliding-window context approach during extraction to resolve pronouns and dependencies without a separate, expensive revision step
provides a cost-effective implementation by fine-tuning open-weight models (Mixtral-8x22B, Llama-3) on high-quality data generated by GPT-4 and GPT-4o

Architecture

The VeriScore pipeline consisting of claim extraction and verification phases.

Evaluation Highlights

Human annotators preferred VeriScore's claim extraction over SAFE's 93% of the time across diverse tasks
VeriScore with Llama-3-70B achieves 0.77 Spearman correlation with human judgments on claim extraction (comparable to GPT-4's 0.79)
GPT-4o achieves the highest average VeriScore (65.8) across 8 diverse datasets, while open-weight Mixtral-8x22B (60.9) is closing the gap with closed models

Breakthrough Assessment

8/10

Addresses a critical flaw in current factuality metrics (the assumption that all text is verifiable). The shift to 'verifiable claims' and the release of fine-tuned open-weight evaluators make this a practical and methodologically sound contribution.

⚙️ Technical Details

Problem Definition

Setting: Evaluating the factuality of long-form text generation across diverse domains

Inputs: A prompt x and a model response r

Outputs: A VeriScore (F1@K score) indicating the factual precision and recall of the response

Pipeline Flow

Claim Extraction (Identify verifiable claims)
Evidence Retrieval (Google Search)
Claim Verification (Classify as Supported/Unsupported)
Score Calculation (F1@K)

System Modules

Claim Extractor

Decompose text into a list of verifiable claims, filtering out unverifiable content

Model or implementation: Fine-tuned Llama-3-8B-Instruct (or GPT-4/GPT-4o in closed setting)

Evidence Retriever

Retrieve relevant external knowledge for verification

Model or implementation: Google Search via Serper API

Claim Verifier

Determine if the evidence supports the claim

Model or implementation: Fine-tuned Llama-3-70B-Instruct (or GPT-4o in closed setting)

Novel Architectural Elements

Sliding window context for extraction: Includes previous and next sentences during extraction to resolve pronouns immediately, eliminating the separate 'claim revision' step used in SAFE
Verifiability filtering: Explicitly trains the extractor to output ONLY verifiable events/states, discarding subjective/unverifiable sentences that confuse prior metrics

Modeling

Base Model: Llama-3-8B-Instruct (Extractor) and Llama-3-70B-Instruct (Verifier)

Training Method: Supervised Fine-Tuning (SFT)

Adaptation: Full fine-tuning (implied by context of open-weight release)

Trainable Parameters: Not reported in the paper

Training Data:

Extractor data: 5,423 extraction examples generated by GPT-4 (500-1000 per domain)
Verifier data: 4,000 extracted claims labeled by GPT-4o (500 per domain)

Key Hyperparameters:

learning_rate: 2e-5 (Llama-3-8B), 1e-5 (Llama-3-70B)
batch_size: 128
epochs: 2 (Llama-3-8B), 1 (Llama-3-70B)
+ 1 more
weight_decay: 0.01

Compute: Extractor fine-tuned on 4x A100 (80G) for 20 mins; Verifier fine-tuned on 8x A100 (80G) for 50 mins

Comparison to Prior Work

vs. FActScore: Generalizes beyond biographies; handles pronoun resolution via sliding window rather than failing on them
vs. SAFE: Extracts only *verifiable* claims (filtering subjective content) vs. extracting everything; removes expensive revision/relevance steps via context-aware extraction
vs. FacTool [not cited in paper]: VeriScore focuses on atomic claim decomposition and search, whereas FacTool often verifies larger claim units or uses different tool chains

Limitations

Dependency on Google Search means verification can fail if relevant documents are hard to retrieve via simple queries
Metric validity depends on the quality of the upstream extraction/verification models (though fine-tuned models show high correlation)
Combining 'Contradicted' and 'Inconclusive' into 'Unsupported' loses granularity in error analysis
Evaluation is limited to English language tasks

Reproducibility

Code: https://github.com/Yixiao-Song/VeriScore

Code and data available at https://github.com/Yixiao-Song/VeriScore. Training data (distilled from GPT-4/GPT-4o) and fine-tuned open-weight models are released.

📊 Experiments & Results

Evaluation Setup

Evaluation of 16 LLMs on 8 diverse long-form generation datasets using VeriScore

Benchmarks:

Biographies (Biography generation)
FreshBooks (Long-form QA (factual))
Koala (Instruction following / Assistant queries)
LFQA (Long-form Question Answering)
Mix-Instruct (Instruction following)
ShareGPT (User-shared conversations)
Spheres (Scientific knowledge generation)
WritingPrompts (Creative writing)

Metrics:

VeriScore (F1@K)
Human preference (win rate)
Spearman correlation
Statistical methodology: Fleiss' kappa for inter-annotator agreement; Spearman correlation for metric-human agreement

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Human evaluation confirms VeriScore's extraction quality significantly outperforms SAFE.
15 random texts across domains	Win Rate (Preference)	7.2	92.8	+85.6
Fine-tuned open-weight models achieve high correlation with GPT-4/Human judgments, validating the cost-effective pipeline.
VeriScore Pipeline	Spearman Correlation (Extraction)	0.58	0.77	+0.19
VeriScore Pipeline	F1 Score (Verification)	73.9	83.6	+9.7
Benchmarking 16 LLMs shows GPT-4o dominance and open-weight models closing the gap.
Average across 8 datasets	VeriScore	59.3	65.8	+6.5

Experiment Figures

Human preference win-rates comparing VeriScore vs SAFE claim extraction across different domains.

Main Takeaways

Factuality varies significantly by domain: a model's performance on biographies does not predict its performance on LFQA, necessitating multi-task evaluation.
Creative writing tasks (WritingPrompts) have extremely low verifiable claim density (ratio 0.03), validating the need for metrics that filter unverifiable content.
Open-weight models like Mixtral-8x22B are becoming competitive with proprietary models (GPT-4) in generating factual content.
VeriScore's extraction method resolves the 'atomic' claim overlap issue found in SAFE, where multiple claims restate the same information with slight variations.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and hallucination
Familiarity with atomic claim decomposition
Knowledge of retrieval-augmented verification (using search engines)

Key Terms

FActScore: A metric that decomposes text into atomic facts and verifies them; originally designed for biography generation

SAFE: Search-Augmented Factuality Evaluator—a metric that uses LLMs to decompose text and verify claims using Google Search

atomic claims: Short statements containing a single piece of information, used as the unit of verification in factuality metrics

verifiable claims: Claims describing a single event or state with necessary modifiers that can plausibly be proven true or false, excluding subjective opinions or metaphors

LFQA: Long-Form Question Answering—tasks requiring detailed, multi-sentence responses

F1@K: A metric balancing factual precision (supported claims / total claims) and recall (supported claims / K), where K is the median number of claims in model responses

open-weight models: Models whose weights are publicly released (e.g., Llama-3, Mixtral), allowing local execution and fine-tuning

sliding window: A technique using surrounding sentences as context during extraction to resolve references (like pronouns) without rewriting

Spearman correlation: A statistical measure of rank correlation, used here to compare automatic metric rankings with human judgment rankings