FactScore: Fine-grained atomic evaluation of factual precision in long form text generation

📝 Paper Summary

Evaluation methodology Factuality and hallucination

FActScore evaluates long-form text factuality by breaking generations into atomic facts and verifying them against a knowledge source, using either human annotation or a retrieval-augmented automated estimator.

Core Problem

Evaluating factual precision in long-form generation is difficult because texts contain a mixture of supported and unsupported information, making binary labels inadequate, and human verification is costly.

Why it matters:

Current binary evaluations ignore partial correctness (e.g., a sentence with 3 true facts and 1 false fact is just labeled 'false')
Even single sentences often contain multiple pieces of information (4.4 per sentence in ChatGPT), 40% of which are a mix of true and false
Existing human evaluation is prohibitively expensive ($26K for 6,500 generations), preventing scalable assessment of new models

Concrete Example: A model generates a bio: 'Michael Jordan played for the Bulls and the Mets.' A binary metric marks this as False because of the baseball error, ignoring the correct basketball fact. FActScore decomposes this into 'played for Bulls' (True) and 'played for Mets' (False), giving a more granular score.

Key Novelty

Atomic-level Factual Precision Scoring (FActScore)

Decomposes long-form text into 'atomic facts' (short statements conveying one piece of information) rather than evaluating at the sentence or document level
Defines truthfulness as 'supported by a specific knowledge source' (e.g., Wikipedia) rather than global truth, resolving ambiguity
Proposes an automated estimator using a retrieve-then-verify pipeline to approximate human judgment without manual effort

Architecture

Illustration of the FActScore concept compared to binary labeling. It shows two generated biographies about 'Ed Yost'.

Evaluation Highlights

Automatic estimator achieves <2% error rate compared to human ground truth when estimating FActScore for various LMs
ChatGPT achieves only 58% FActScore on biography generation, significantly lower than expected for a state-of-the-art model
PerplexityAI (search-augmented) scores 71.5%, dropping to 16% sentence-level accuracy for rare entities, showing search augmentation is not a silver bullet

Breakthrough Assessment

9/10

Establishes a rigorous standard for fine-grained factuality evaluation. The atomic decomposition approach and the release of an automated metric with <2% error rate are significant practical contributions.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of factual precision in long-form text generation against a knowledge source

Inputs: A generation y produced by model M for prompt x, and a knowledge source C (e.g., Wikipedia)

Outputs: FActScore: The percentage of atomic facts in y that are supported by C

Pipeline Flow

Atomic Fact Generation (Model generates facts from text)
Retrieval (Find evidence in Knowledge Source)
Verification (Determine if fact is supported)

System Modules

Atomic Fact Generator

Break long-form generation into atomic statements

Model or implementation: InstructGPT (text-davinci-003) or GPT-3.5-turbo

Retriever (Verification)

Retrieve relevant context for each atomic fact

Model or implementation: GTR-Base (Generalizable T5-based Retriever)

Verifier (Verification)

Determine if the atomic fact is supported by the retrieved passages

Model or implementation: Fine-tuned LLaMA-65B or specialized NLI model

Novel Architectural Elements

Definition of 'atomic fact' as the evaluation unit to handle mixed-truth sentences
Pipeline decoupling decomposition from verification to allow fine-grained error analysis
Scoring definition conditioned on a specific knowledge source (C) rather than general world knowledge

Modeling

Base Model: Evaluator uses various backbones (e.g., LLaMA-65B, OPT-66B, GPT-3) for the verification step

Training Method: Fine-tuning for the verifier component (in automated estimator)

Adaptation: Full fine-tuning on NLI/fact-checking datasets

Trainable Parameters: Varies by verifier model (e.g., 65B for LLaMA)

Training Data:

Human-annotated biography data generated during the study (4,763 atomic facts from 183 entities)
Existing NLI datasets for initialization

Key Hyperparameters:

retrieval_k: 5 (number of passages retrieved per fact)

Compute: Not reported in the paper

Comparison to Prior Work

vs. FactCheck: Uses atomic facts instead of sentences, capturing partial correctness
vs. Attribution Evaluation: Verifies the content against a trusted corpus (Wikipedia) rather than just checking provided citations
vs. SelfCheckGPT: Uses an explicit external knowledge source (Wikipedia) rather than internal model consistency [not cited in paper as direct baseline, but used as comparison point in estimator section]
+ 1 more
vs. ROUGE/BLEU: Measures factual correctness, not n-gram overlap with a reference summary

Limitations

Assumes information in the knowledge source (Wikipedia) is non-conflicting and undebatable
Measures precision only, ignoring recall (information quantity/coverage)
Automated estimator relies on retrieval quality; poor retrieval leads to false negatives
Computational cost of the automated metric is high due to multiple LM calls (decomposition + verification)

Reproducibility

Code: https://github.com/shmsw25/FActScore

Code and data publicly available at https://github.com/shmsw25/FActScore. Annotated dataset of 4.7k atomic facts is released. Automated metric is installable via pip.

📊 Experiments & Results

Evaluation Setup

Generation of biographies for 183 people entities sampled from Wikidata (diverse nationalities/rarities). Knowledge source: English Wikipedia (April 2023).

Benchmarks:

Human Annotation Study (Manual verification of generated biographies) [New]
Automated Metric Validation (Correlation of automated score with human labels) [New]

Metrics:

FActScore (percentage of supported atomic facts)
Error Rate (difference between estimated and human FActScore)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Human evaluation of commercial LMs reveals significant hallucination rates, even in search-augmented models.
Human Annotation	FActScore	42.5	71.5	+29.0
Human Annotation	Facts/sent	2.5	4.4	+1.9
Automated estimator performance shows high correlation with human judgment, allowing scalable evaluation.
Automated Metric	Error Rate	0	1.8	+1.8
Analysis of entity rarity shows a steep drop in performance for less famous entities.
Human Annotation	FActScore	80	16	-64

Experiment Figures

FActScore performance across entity frequency (rarity) and position in text.

Main Takeaways

Commercial LMs (ChatGPT, InstructGPT) are riddled with factual errors (42-58% FActScore), and search augmentation (PerplexityAI) only improves this to 71.5%, which is still far from perfect.
Atomic facts are necessary: 40% of ChatGPT sentences contain a mix of supported and unsupported facts, rendering sentence-level binary evaluation inaccurate.
Factual precision correlates strongly with entity popularity; even search-augmented models struggle significantly with rare entities.
Error rates increase later in the generation, suggesting error propagation or loss of context over longer texts.
PerplexityAI's citations do not guarantee accuracy; ~37% of unsupported sentences still had citations.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and hallucination
Familiarity with retrieval-augmented generation (RAG)
Basic knowledge of Natural Language Inference (NLI) or Fact Verification

Key Terms

atomic fact: A short sentence conveying exactly one piece of information, used as the fundamental unit of evaluation

InstructGPT: An OpenAI model (text-davinci-003) trained to follow instructions

PerplexityAI: A commercial conversational search engine that generates answers with citations based on live web search results

FActScore: Factual precision in Atomicity Score—metric representing the percentage of atomic facts supported by a knowledge source

NLI: Natural Language Inference—determining if a hypothesis is true (entailed), false (contradicted), or neutral given a premise

LMsubj: The Language Model acting as the 'Subject' being evaluated (e.g., ChatGPT, Vicuna)

Recall: The fraction of relevant instances that were retrieved; FActScore explicitly measures precision (correctness), not recall (completeness)

abstain: When a model refuses to answer a prompt (e.g., 'I don't know'), which avoids generating false information