FaStfact: Faster, Stronger Long-Form Factuality Evaluations in LLMs

📝 Paper Summary

Long-form factuality evaluation Hallucination detection

FaStfact improves long-form factuality evaluation by using dynamic chunking and confidence-based pre-verification to reduce costs, while scraping full web pages instead of snippets to ensure sufficient evidence.

Core Problem

Existing long-form factuality evaluators are inefficient due to redundant sentence-level processing and ineffective due to insufficient evidence (short snippets) and inaccurate claim extraction.

Why it matters:

Inefficient pipelines (high time/token costs) cannot scale to evaluate long documents generated by modern LLMs
Existing methods frequently produce unverifiable, redundant, or missing claims because sentence-level processing misses global context
Reliance on short search snippets (20-40 words) often leads to 'inconclusive' verification even when ample evidence exists online

Concrete Example: In a case study of SAFE, 68% of claims extracted from a GPT-3.5 response were problematic (redundant or unverifiable). Furthermore, verifiers often lack context because they only see short Google search snippets, leading to false negatives where true claims are labeled 'not enough evidence'.

Key Novelty

FaStfact (Chunk-based Extraction + Pre-verification + Full-Page Evidence)

Replaces sentence-level extraction with dynamic chunking to process larger contexts at once, reducing inference calls and capturing inter-sentence dependencies
Introduces confidence-based pre-verification where the LLM verifies 'easy' claims using internal knowledge, skipping external search if confidence is high
Fetches full web page content instead of short search snippets to create a comprehensive document-level knowledge base for 'hard' claims requiring external verification

Evaluation Highlights

Achieves highest alignment with human evaluation compared to baselines like FActScore and SAFE on the new FaStfact-Bench
Significantly reduces processing time and token costs compared to SAFE and FActScore due to dynamic chunking and pre-verification
Reduces the rate of inconclusive verifications by providing full document-level evidence rather than truncated snippets

Breakthrough Assessment

8/10

Strong engineering contribution that fixes major efficiency bottlenecks in the standard decompose-then-verify pipeline while simultaneously improving evidence quality. The release of a fine-grained annotated benchmark is also valuable.

⚙️ Technical Details

Problem Definition

Setting: Evaluating the factuality of long-form text generated by an LLM in response to an open-ended question

Inputs: Prompt/Question x, Generated Response y

Outputs: Factuality score (percentage of supported claims)

Pipeline Flow

Group 1: Extraction & Pre-verification (Dynamic Chunking → Extraction → Internal Check)
Group 2: Evidence Gathering (Web Search → Full Page Scraping → BM2.5 Retrieval)
Group 3: Verification & Scoring (Evidence-based Verification → Aggregation)

System Modules

Claim Extractor (Extraction & Pre-verification)

Extract atomic claims from text chunks and assign initial pre-verification labels

Model or implementation: LLM (e.g., GPT-4o-mini or similar)

Confidence Filter (Extraction & Pre-verification)

Decide whether to accept the pre-verification label or trigger external search

Model or implementation: Rule-based threshold

Evidence Scraper (Evidence Gathering)

Fetch full content of web pages for claims needing verification

Model or implementation: Jina Reader API / Web Scraper

Evidence Retriever (Evidence Gathering)

Select most relevant chunks from the scraped full-page documents

Model or implementation: BM2.5

Claim Verifier

Verify claim against retrieved evidence chunks

Model or implementation: LLM

Novel Architectural Elements

Integrated Extraction-Preverification Module: Performing extraction and confidence-based verification in a single inference pass
Dynamic Chunking Window: Configurable sliding window (chunk stride) for extraction instead of fixed sentence-level extraction

Modeling

Base Model: GPT-4o / GPT-4o-mini (used as the backbone for extraction and verification agents in experiments)

Training Method: Prompt Engineering / In-context Learning

Adaptation: None (uses off-the-shelf LLMs)

Trainable Parameters: None

Key Hyperparameters:

chunk_stride: Configurable (1 to full length)
confidence_threshold: Calibrated based on logprobs (exact value not fixed, tuned per use case)
scraped_doc_length: ~7054 words (average)
+ 1 more
snippet_length: ~24 words (baseline average)

Compute: Significantly lower token cost and time than SAFE/FActScore (exact reduction depends on chunk size)

Comparison to Prior Work

vs. SAFE: FaStfact uses dynamic chunking + pre-verification (faster) and full-page scraping (stronger evidence) vs. SAFE's sentence-level + snippet-based approach
vs. FActScore: FaStfact targets open-domain (web search) vs. FActScore's closed-domain (Wikipedia) focus
vs. VeriScore: FaStfact integrates extraction and pre-verification in one pass; VeriScore separates them. FaStfact fetches full pages; VeriScore uses snippets.

Limitations

Relies on the availability and speed of external APIs (Serper, Jina Reader) for evidence collection
Confidence calibration for pre-verification may require tuning for different base LLMs
Scraping full web pages introduces latency that must be offset by the efficiency gains in extraction
Performance depends heavily on the backbone LLM's instruction-following capability

Reproducibility

Code: https://github.com/Yingjia-Wan/FastFact

Code, benchmark data (FaStfact-Bench), and annotation interface are publicly available at https://github.com/Yingjia-Wan/FastFact. Uses Jina Reader API for scraping and Serper API for search.

📊 Experiments & Results

Evaluation Setup

Factuality evaluation of long-form generations using a newly annotated benchmark

Benchmarks:

FaStfact-Bench (Long-form QA Factuality Annotation) [New]

Metrics:

Processing Time (seconds)
Token Cost
Alignment with Human Judgment (Correlation / Error Rate)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
FaStfact-Bench	Gap from Ground Truth Score	See Table 1 in paper (implied)	See Table 1 in paper (implied)	Smaller gap (improved alignment)
FaStfact-Bench	Processing Time	High	Low	Superior efficiency
LongFact (random sample)	Claim Extraction Failure Rate	68%	Lower (implied by design)	Improvement

Main Takeaways

FaStfact is faster and cheaper than SAFE and FActScore due to dynamic chunking and pre-verification skipping unnecessary searches.
The use of full-page evidence significantly reduces 'insufficient evidence' errors compared to snippet-based verification.
Chunk-level extraction reduces redundant and unverifiable claims by providing the model with broader context during decomposition.
Confidence-based pre-verification effectively filters out easy claims, reserving expensive web search for hard/uncertain claims.

📚 Prerequisite Knowledge

Prerequisites

Understanding of the decompose-then-verify framework for factuality
Familiarity with RAG (Retrieval-Augmented Generation) concepts like chunking and retrieval
Basic knowledge of LLM log-probabilities for confidence estimation

Key Terms

Decompose-then-verify: A framework where long text is broken into atomic claims, each is verified individually, and scores are aggregated

Atomic claim: A single, indivisible statement of fact extracted from a longer text

Chunking: Breaking text into segments (chunks) of a specific size (e.g., number of sentences) for processing

Pre-verification: Using the LLM's internal knowledge to verify claims immediately after extraction, skipping external search for high-confidence cases

BM2.5: A probabilistic information retrieval function used to rank documents based on query term frequency

Logprobs: Log-probabilities of tokens generated by an LLM, used here to measure the model's confidence in its pre-verification label

Snippets: Short text previews (20-40 words) returned by search engines like Google, often used as insufficient evidence in prior work