Long-form factuality in large language models

📝 Paper Summary

Factuality evaluation Long-form generation LLM-as-a-judge

The paper introduces LongFact, a large-scale prompt set for long-form factuality, and SAFE, an automated evaluator using search-augmented LLMs that outperforms human crowdsourced annotators in cost and accuracy.

Core Problem

LLMs frequently hallucinate or produce factual errors in open-ended long-form responses, yet existing benchmarks focus on short answers or single factoids, making comprehensive evaluation difficult.

Why it matters:

Current benchmarks like TruthfulQA or HaluEval mostly test short-answer factoids, failing to capture the complexity of multi-paragraph responses
Human annotation for long-form text is expensive ($4.00/response), slow, and hard to scale
Existing automated metrics (BLEURT, ROUGE) rely on reference answers, which are difficult to compile for open-ended questions

Concrete Example: When asked 'What is the Eiffel Tower?', an LLM might generate a paragraph with multiple claims. One sentence might correctly state it's in Paris, while another incorrectly claims it opened in the 20th century. Standard metrics evaluating the whole text against a reference summary often miss these fine-grained factual contradictions.

Key Novelty

Search-Augmented Factuality Evaluator (SAFE)

Decomposes long-form responses into individual atomic facts using an LLM
Uses a multi-step reasoning agent to generate Google Search queries for each fact and determine support based on search results
Introduces F1@K, a metric balancing factual precision (supported facts) with recall (facts provided relative to a target length K)

Architecture

The 4-step workflow of the Search-Augmented Factuality Evaluator (SAFE)

Evaluation Highlights

SAFE agrees with crowdsourced human annotators 72.0% of the time on a set of ~16k individual facts
SAFE wins 76% of disagreement cases against crowdsourced human annotators (when ground truth is determined by researchers with full internet access)
SAFE is more than 20 times cheaper than human annotators ($0.19 vs $4.00 per response)

Breakthrough Assessment

9/10

Significantly advances automated evaluation by demonstrating LLM agents can outperform crowdsourced humans for fact-checking. The release of a massive prompt set (LongFact) and the agent code addresses a major bottleneck in factuality research.

⚙️ Technical Details

Problem Definition

Setting: Evaluating the factuality of long-form LLM responses y given a prompt x without a pre-set reference answer

Inputs: Prompt x, Model response y

Outputs: F1@K score based on the count of supported facts S(y) and not-supported facts N(y)

Pipeline Flow

Response Decomposition (Split into sentences -> Split into atomic facts)
Fact Revision (Make facts self-contained by resolving coreferences)
Relevance Check (Filter out irrelevant statements)
Fact Verification (Iterative Search + Reasoning)

System Modules

Fact Splitter (Decomposition)

Break down long-form responses into individual sentences and then into atomic facts

Model or implementation: gpt-3.5-turbo-0125

Fact Reviser (Decomposition)

Revise individual facts to be self-contained by replacing vague references (pronouns) with proper entities

Model or implementation: gpt-3.5-turbo-0125

Relevance Checker

Determine if a fact is relevant to answering the prompt (e.g., filtering 'I don't know')

Model or implementation: gpt-3.5-turbo-0125

Verifier Agent

Iteratively issue Google Search queries and reason about whether search results support the fact

Model or implementation: gpt-3.5-turbo-0125 connected to Serper API

Novel Architectural Elements

Iterative search-and-reason loop for fact verification: the model dynamically generates search queries based on the claim and previous search results, rather than using a static retrieval step

Modeling

Base Model: gpt-3.5-turbo-0125 (used as the evaluator agent)

Compute: Total cost of $96.31 to rate 16,011 individual facts (~$0.19 per response) using GPT-3.5 and Serper API

Comparison to Prior Work

vs. FActScore: SAFE uses dynamic Google Search instead of static Wikipedia dumps, covering a broader range of topics
vs. FacTool: SAFE introduces a specific multi-step reasoning process for query generation and evidence synthesis specifically for long-form text
vs. Human Annotation: SAFE is 20x cheaper and demonstrably more accurate on disagreement cases

Limitations

Relies on Google Search, which may miss information or lack depth in expert domains (law, medicine)
Dependent on the capability of the underlying LLM (GPT-3.5) for reasoning and instruction following
F1@K metric assumes no repeated facts; models could theoretically game this by repeating supported facts
Evaluation cost scales with the number of facts in the response

Reproducibility

Code: https://github.com/google-deepmind/long-form-factuality

publicly available. Code for SAFE and the LongFact dataset are at https://github.com/google-deepmind/long-form-factuality. The prompt set contains 2,280 prompts. The evaluator uses closed-source models (GPT-3.5, GPT-4) and APIs (Serper).

📊 Experiments & Results

Evaluation Setup

Open-domain long-form question answering across 38 topics

Benchmarks:

LongFact (Long-form generation) [New]

Metrics:

F1@K (Aggregated Long-form Factuality)
Precision (Ratio of supported facts)
Recall (Ratio of supported facts provided relative to K)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of SAFE evaluator performance against human annotators shows SAFE is both more accurate and cheaper.
Min et al. (2023) dataset	Agreement Rate	100	72.0	-28.0
Min et al. (2023) dataset	Win Rate on Disagreements	19	76	+57
Min et al. (2023) dataset	Cost per response ($)	4.00	0.19	-3.81
Benchmarking LLMs on LongFact using SAFE shows larger models generally perform better.
LongFact-Objects	F1@64	40.5	66.4	+25.9
LongFact-Objects	F1@64	50.4	60.3	+9.9
LongFact-Objects	F1@64	13.2	55.3	+42.1
LongFact-Objects	Precision (Prec)	92.8	88.5	-4.3

Experiment Figures

Comparison of SAFE vs. Human Annotators on disagreement cases and cost

F1@K benchmark scores for 13 models on LongFact

Main Takeaways

Larger language models (GPT-4-Turbo, Gemini-Ultra) generally achieve better long-form factuality than smaller counterparts.
RLHF (Reinforcement Learning from Human Feedback) appears to significantly improve long-form factuality (e.g., PaLM-2-L-IT-RLHF vs PaLM-2-L-IT).
Newer models like Claude-3-Opus and Gemini-Ultra match or surpass GPT-4 (classic) performance.
LLM agents can serve as reliable, super-human factuality annotators when equipped with search tools and iterative reasoning.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and hallucination
Familiarity with F1 score (precision and recall)
Basic knowledge of LLM agents and tool use (search APIs)

Key Terms

LongFact: A newly proposed prompt set comprising 2,280 questions across 38 topics designed to elicit long-form, fact-heavy responses

SAFE: Search-Augmented Factuality Evaluator—an LLM agent workflow that decomposes text into facts and verifies them using Google Search

F1@K: A metric for long-form factuality that balances precision (percentage of supported facts) and recall (percentage of supported facts relative to a target number K)

FActScore: A prior metric/framework for evaluating long-form factuality by breaking text into atomic facts and verifying them against Wikipedia

atomic fact: A single, self-contained piece of information extracted from a longer sentence (e.g., 'The Eiffel Tower is in Paris')

hallucination: When a model generates content that is factually incorrect or nonsensical with respect to its internal knowledge or external reality

LLM agent: An LLM setup that can use tools (like Google Search) and perform multi-step reasoning to complete a task

SFT: Supervised Fine-Tuning—training a model on labeled examples

RLHF: Reinforcement Learning from Human Feedback—a training method to align models with human preferences