VeriFastScore: Speeding up long-form factuality evaluation

📝 Paper Summary

Long-form factuality evaluation Automated evaluation metrics

VeriFastScore accelerates long-form factuality evaluation by training a single Llama-3 model to simultaneously extract and verify claims against bulk evidence, replacing slow multi-step pipelines.

Core Problem

Existing long-form factuality metrics like VeriScore require a slow, multi-stage pipeline (claim decomposition → per-claim retrieval → verification), often incurring ~60 LLM/API calls per response.

Why it matters:

High latency (~100 seconds per response) makes current metrics impractical for real-time evaluation or large-scale benchmarks
Excessive API costs limit the use of factuality metrics as reward signals for reinforcement learning (RLHF)
Standard few-shot prompting of closed models (e.g., GPT-4o) fails at this complex task, achieving low correlation with ground truth

Concrete Example: A 14-sentence response typically yields ~23 claims. VeriScore triggers 14 extraction calls, 23 Google searches, and 23 verification calls. VeriFastScore replaces this with 1 search step and 1 model inference pass.

Key Novelty

Single-Pass Decompose-and-Verify Evaluator

Replaces the sequential pipeline of extracting claims then verifying them individually with a single model pass that does both simultaneously using consolidated evidence
Uses retrieval based on full sentences rather than atomic claims to gather evidence before decomposition, allowing the model to verify 'in-context' of the search results
Trains on high-quality synthetic data generated by the slower, rigorous VeriScore pipeline to distill its capability into a faster, open-weights model

Architecture

Comparison of the VeriScore pipeline vs. the VeriFastScore pipeline.

Evaluation Highlights

Achieves 0.80 Pearson correlation with the rigorous VeriScore pipeline, significantly outperforming GPT-4o few-shot (0.33 correlation)
Delivers a 6.64x overall wall-clock speedup (9.9x modeling speedup) compared to the original VeriScore pipeline
Maintains strong system-level correlation (r=0.94) with VeriScore rankings, ensuring reliable model comparison at a fraction of the cost

Breakthrough Assessment

8/10

Significantly reduces the cost/time barrier for high-quality factuality evaluation (6x speedup) while maintaining high correlation. The distillation from a slow pipeline to a fast single model is a practical engineering breakthrough.

⚙️ Technical Details

Problem Definition

Setting: Given a model response r and retrieved evidence context E, output a set of verifiable claims C and their boolean verification status (Supported/Unsupported).

Inputs: Full model response r (long-form text) and consolidated evidence snippets E (concatenated search results)

Outputs: List of (claim, label) pairs

Pipeline Flow

Evidence Retrieval (Sentence-level queries)
Consolidation
Joint Extraction & Verification (Single Inference Pass)

System Modules

Evidence Retriever

Fetch relevant web snippets for the response without prior claim extraction

Model or implementation: Google Search via SERPER API

Evidence Consolidator

Combine retrieved snippets into a single context window

Model or implementation: Concatenation logic

Evaluator Model

Simultaneously extract verifiable claims and label them as Supported/Unsupported based on context

Model or implementation: Llama-3.1-8B-Instruct (fine-tuned)

Novel Architectural Elements

Inverted pipeline order: Retrieval happens *before* claim decomposition (using sentences as proxies), whereas standard pipelines decompose first then retrieve specific evidence
Single-pass joint extraction and verification head: Replaces two separate LLM calls (extractor + verifier) with one generation step

Modeling

Base Model: Llama-3.1-8B-Instruct

Training Method: Supervised Fine-Tuning (SFT)

Training Data:

Source prompts: Tulu3 Personas dataset (filtered for factuality)
Synthetic Labels: Generated by running the VeriScore pipeline (decomposition + verification)
Dataset size: ~9K prompt-response pairs
Two-stage training: First on claim-level evidence (from VeriScore), then mixed with sentence-level evidence

Key Hyperparameters:

base_model: Llama-3.1-8B-Instruct
evidence_context_length: ~4000 tokens (average)

Compute: Inference time is ~15 seconds per response (vs ~100s for VeriScore). Trained on Lambda Labs infrastructure (exact GPU count not reported).

Comparison to Prior Work

vs. VeriScore: 6.6x faster wall-clock time; single model call vs ~60 calls; retrieves using sentences instead of claims
vs. FactScore: VeriFastScore is a distilled evaluator model rather than a pipeline framework; operates on verifiable claims only
vs. SAFE: Uses open-weights model (Llama-3) instead of closed API models; single-pass vs multi-step prompting

Limitations

Evidence retrieval using full sentences may result in noisier or less precise context compared to claim-specific queries
Dependent on the quality of the teacher pipeline (VeriScore) for training data; inherits its errors
Evaluated primarily on English general-domain data; multilingual performance not tested

Reproducibility

Code: https://github.com/RishanthRajendhran/VeriFastScore

Publicly available: Code and synthetic datasets at https://github.com/RishanthRajendhran/VeriFastScore. Missing: Exact training GPU hours.

📊 Experiments & Results

Evaluation Setup

Evaluated on ability to replicate VeriScore's factuality ratings on held-out Tulu3 Personas responses

Benchmarks:

Tulu3 Personas (Held-out) (Long-form generation / Factuality Evaluation)

Metrics:

Pearson Correlation (r) with VeriScore
Claim Precision/Recall (Paraphrase-aware)
Claim Accuracy (Correct Label)
Wall-clock runtime
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
VeriFastScore demonstrates high correlation with the expensive VeriScore pipeline, significantly outperforming zero/few-shot prompting approaches.
Tulu3 Personas	Pearson r (Example-level)	0.33	0.80	+0.47
Tulu3 Personas	Pearson r (System-level)	1.00	0.94	-0.06
Efficiency benchmarks show massive speedups in both modeling and total pipeline time.
Runtime Analysis	Total Wall-clock Time (seconds/response)	99.96	15.06	-84.90
Runtime Analysis	Modeling Latency (seconds/response)	59.98	6.06	-53.92

Main Takeaways

Simultaneous decomposition and verification is too complex for standard prompting (GPT-4o r=0.33) but solvable via fine-tuning (VeriFastScore r=0.80).
Retrieving evidence using sentences (proxy queries) works surprisingly well, despite lacking the precision of atomic claim queries.
Fine-tuning on mixed evidence (claim-level + sentence-level) improves robustness compared to training on sentence-level evidence alone.
Evaluation using exact matching underestimates performance; LLM-as-a-judge (paraphrase-aware) evaluation reveals much higher accuracy (71.6% vs 23.7%).

📚 Prerequisite Knowledge

Prerequisites

Understanding of RAG (Retrieval-Augmented Generation)
Familiarity with atomic claim decomposition
Knowledge of LLM fine-tuning and synthetic data generation

Key Terms

VeriScore: A high-quality but slow factuality evaluation pipeline that decomposes text into atomic claims and verifies them individually against Google Search results

atomic claims: Short, single-fact statements extracted from a longer text, containing one event or state

factual precision: The proportion of verifiable claims in a response that are supported by evidence

factual recall: The number of supported claims relative to the median number of claims across responses (penalizing verbosity or reticence)

F1@K: A harmonic mean of factual precision and recall used to score model factuality

SERPER API: A search engine API (Google Search wrapper) used to retrieve evidence snippets

Pearson correlation: A statistical measure (r) of linear correlation between two sets of data (here, agreement between evaluator scores)

distillation: The process of training a smaller or faster model (student) to replicate the behavior of a larger or more complex system (teacher)