Generating Benchmarks for Factuality Evaluation of Language Models

📝 Paper Summary

Factuality evaluation Hallucination suppression

FACTOR automatically transforms factual corpora into multiple-choice benchmarks where LMs must distinguish true completions from similar but non-factual perturbations, measuring propensity for factual generation.

Core Problem

Existing factuality evaluations rely on sampling from the LM itself (over-representing common facts) or perplexity (a noisy proxy affected by style), failing to measure accuracy on a controlled set of domain-specific facts.

Why it matters:

LMs frequently generate factually incorrect information, hindering deployment in sensitive domains
Current sampling-based metrics bias evaluation toward high-likelihood facts, ignoring the 'long-tail' of rare facts
Perplexity is influenced by linguistic phenomena other than factuality and does not always correlate with factual correctness

Concrete Example: A model might complete 'Steve Jobs left Apple in...' correctly with '1985' but assign high likelihood to plausible errors like '1988' or 'forced out of NeXT'. FACTOR tests if the true fact (1985) is ranked higher than these specific, generated contradictions.

Key Novelty

Factual Assessment via Corpus TransfORmation (FACTOR)

Automatically transforms a text corpus into a benchmark by using a helper LLM (InstructGPT) to generate non-factual, fluent, and similar contradictions for specific sentences
Evaluates models based on their ability to assign higher likelihood to the original true sentence compared to the generated non-factual alternatives
Categorizes errors into specific types (Entity, Predicate, Circumstance, etc.) to ensure diversity in the non-factual distractors

Architecture

Conceptual overview of the FACTOR evaluation task

Evaluation Highlights

Larger models perform better but still struggle: OPT-66B achieves only 68.1% on News-FACTOR and 55.9% on Expert-FACTOR
Retrieval augmentation (IC-RALM) consistently improves FACTOR scores across all model sizes (e.g., GPT-Neo-2.7B improves from ~46% to ~50% on Wiki-FACTOR)
FACTOR ranking diverges from perplexity: OPT-66B has higher perplexity (worse) than GPT-J-6B (7.6 vs 7.4) but significantly better FACTOR accuracy (57.7% vs 53.5%)

Breakthrough Assessment

8/10

A novel, scalable methodology for creating controlled factuality benchmarks without human labeling. It exposes discrepancies between perplexity and factuality, offering a more precise metric for knowledge-intensive tasks.

⚙️ Technical Details

Problem Definition

Setting: Multi-choice task ranking a factual completion against generated non-factual alternatives

Inputs: A prefix text t and a set of candidate completions (one true c+, three false c-)

Outputs: The completion c assigned the highest mean log-probability by the LM

Pipeline Flow

Corpus Processing: Select prefix and factual completion
Generation: InstructGPT generates multiple contradictions per error type
Filtering: NLI checks for contradiction; LM checks for fluency
Selection: Choose top 3 diverse non-factual completions

System Modules

Contradiction Generator

Generate non-factual variations of the true sentence

Model or implementation: InstructGPT (text-davinci-003)

Contradiction Filter (NLI) (Filtering)

Verify candidates actually contradict the premise

Model or implementation: DeBERTa-large-mnli

Fluency Filter (LM) (Filtering)

Ensure candidates are grammatical and natural

Model or implementation: GPT2-Small

Novel Architectural Elements

Automated benchmark generation pipeline that transforms any text corpus into a discriminative factuality test

Modeling

Base Model: Evaluated models: OPT (125M-66B), GPT-Neo (125M-20B), GPT-2 (110M-1.5B)

Training Method: Zero-shot evaluation on generated benchmark

Adaptation: None (evaluation only)

Trainable Parameters: None (evaluation only)

Training Data:

Wiki-FACTOR: 2994 examples from The Pile (Wikipedia val split)
News-FACTOR: 1036 examples from Reuters (RefinedWeb)
Expert-FACTOR: 236 examples from ExpertQA

Compute: Not reported in the paper

Comparison to Prior Work

vs. Perplexity: FACTOR focuses specifically on factual correctness rather than general prediction, showing divergent rankings
vs. FActScore/Lee et al.: FACTOR evaluates a controlled set of corpus facts (including rare ones) rather than sampling from the model (which biases toward common facts)
vs. TRUE (Honovich et al.) [not cited in paper]: FACTOR evaluates open-ended generation propensity rather than factual consistency of summarization/NLI

Limitations

Benchmark generation relies on proprietary models (InstructGPT) which may change or become unavailable
Automated filtering (NLI/Fluency) is not perfect; manual validation showed ~97% accuracy, not 100%
Current error typology and prompts are designed for Wikipedia/News; may need adaptation for highly specialized domains
Multiple-choice evaluation is a proxy for open-ended generation (though shown to correlate)

Reproducibility

Code: https://github.com/AI21Labs/factor

Data and code are publicly available. Pipeline relies on OpenAI API (text-davinci-003) which is deprecated/legacy, potentially affecting exact reproduction of benchmark generation.

📊 Experiments & Results

Evaluation Setup

Multiple-choice classification: select the true completion among 4 options (1 true, 3 generated false)

Benchmarks:

Wiki-FACTOR (Encyclopedic knowledge evaluation) [New]
News-FACTOR (Current events knowledge evaluation) [New]
Expert-FACTOR (Domain-specific expert knowledge evaluation) [New]

Metrics:

FACTOR Accuracy (percentage of examples where true completion has highest likelihood)
Perplexity (token-level)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Wiki-FACTOR	FACTOR Accuracy	53.5	57.7	+4.2
Wiki-FACTOR	FACTOR Accuracy	58.0	58.0	0.0
News-FACTOR	FACTOR Accuracy	68.1	68.1	0.0
Wiki-FACTOR	FACTOR Accuracy	46.3	50.0	+3.7
Wiki-FACTOR (Validation)	True Claims % (Open Generation)	24.8	38.8	+14.0

Experiment Figures

Scatter plot of Wiki-FACTOR scores vs LM perplexity for GPT-Neo and OPT families

Bar charts comparing standard LMs vs IC-RALM (retrieval augmented) variants on Wiki-FACTOR

Main Takeaways

FACTOR accuracy generally scales with model size, but different model families (OPT vs GPT-Neo) show strengths in different domains (News vs Wiki)
Retrieval augmentation (IC-RALM) improves factuality consistently across model sizes, validating FACTOR as a metric for RAG systems
Perplexity and FACTOR accuracy can disagree on rankings; when they do, FACTOR better predicts human-evaluated factuality in open-ended generation
Even large models (66B) struggle with these benchmarks (max ~68% accuracy), indicating significant room for improvement in factual precision

📚 Prerequisite Knowledge

Prerequisites

Language Modeling (likelihood/perplexity)
Retrieval-Augmented Generation (RAG)
NLI (Natural Language Inference) for contradiction detection

Key Terms

FACTOR: Factual Assessment via Corpus TransfORmation—the proposed framework for generating factuality benchmarks

IC-RALM: In-Context Retrieval-Augmented Language Models—augmenting LMs by prepending retrieved documents to the context without training

Perplexity: A measurement of how well a probability model predicts a sample; often used as a proxy for LM quality but shown here to imperfectly correlate with factuality

Edit-distance: A measure of dissimilarity between two strings (e.g., Levenshtein distance), used here to ensure false completions are similar to true ones

NLI: Natural Language Inference—determining if a hypothesis is true, false, or neutral given a premise; used here to filter generated contradictions

The Pile: A large-scale, diverse dataset for language modeling; the Wikipedia validation split is used for Wiki-FACTOR

InstructGPT: A model fine-tuned with human feedback (RLHF); used here as the generator for non-factual contradictions

Semantic frame error: Errors involving the main predicate or arguments (Entity, Predicate, Circumstance)

Discourse error: Errors involving relationships between sentences (Coreference, Link)