FELM: Benchmarking Factuality Evaluation of Large Language Models

📝 Paper Summary

Factuality Evaluation LLM Benchmarking

FELM is a benchmark for evaluating LLM factuality across five domains using fine-grained segment-level annotations, error types, and reference links to gauge the reliability of factuality evaluators.

Core Problem

Existing factuality benchmarks focus narrowly on world knowledge or specific tasks like summarization, often using text from weaker models, which fails to capture the diverse hallucination patterns of modern LLMs.

Why it matters:

LLMs like ChatGPT are widely used for diverse tasks (math, coding, reasoning) beyond simple fact retrieval, yet they still hallucinate significantly.
Without reliable 'meta-evaluation' benchmarks (evaluating the evaluators), it is impossible to gauge progress in developing automated factuality detection systems.
Current factuality metrics often lack granularity, making it difficult for users to pinpoint exactly which part of a long response is incorrect.

Concrete Example: A prompt asking 'Is it true that new year's day 2023 falls on a Friday the 13th?' might trigger ChatGPT to agree ('Yes, it is true...'). Current evaluators might miss this 'Fooled error' or fail to identify the specific incorrect segment within a long response.

Key Novelty

Multi-Domain Segment-Level Factuality Benchmark

Broadens factuality to five domains (World Knowledge, Science/Tech, Math, Reasoning, Writing/Rec) to match LLM capabilities, rather than just Wikipedia-based knowledge.
Uses segment-level granularity (splitting responses into self-contained text spans) for precise error localization, unlike whole-response labels.
Provides rich meta-data for every error: specific error type (e.g., Knowledge error, Reasoning error), error reasoning, and URL references supporting the judgment.

Architecture

Data scheme of FELM showing the annotation process and structure.

Evaluation Highlights

The overall factual error rate of ChatGPT on FELM is 31.8% at the response level.
Human annotators achieve a high segment-level agreement rate of 90.7% on average.
Current LLMs (ChatGPT and GPT-4) struggle as evaluators; findings show they are far from satisfactory in faithfully detecting factual errors.

Breakthrough Assessment

8/10

Significantly expands the scope of factuality evaluation beyond standard Wikipedia/summarization tasks to include reasoning and math, with high-quality expert annotation.

⚙️ Technical Details

Problem Definition

Setting: Meta-evaluation of factuality detection: Given a prompt and an LLM-generated response, the task is to identify factual errors at the segment level.

Inputs: A prompt p and an LLM response r (segmented into s_1, s_2, ... s_n).

Outputs: For each segment, a binary factuality label (Correct/Incorrect), error type, error reason, and supporting reference links.

Pipeline Flow

Prompt Collection (from varied sources)
Response Generation (ChatGPT)
Segmentation (Sentence-level or ChatGPT-assisted)
Human Annotation (Labels, Types, Reasons, References)

System Modules

Prompt Collection (Data Construction)

Gather diverse prompts across 5 domains

Model or implementation: N/A (Sourced from Quora, TruthfulQA, MMLU, GSM8K, etc.)

Response Generation (Data Construction)

Generate text to be evaluated

Model or implementation: ChatGPT (Zero-shot)

Segmentation (Data Construction)

Break responses into granular units for annotation

Model or implementation: NLTK tokenizer or ChatGPT

Annotation (Data Construction)

Label segments for factuality

Model or implementation: Expert Human Annotators

Novel Architectural Elements

Inclusion of non-traditional factuality domains: Math and Reasoning
Two-stage segmentation strategy combining heuristic (NLTK) and model-based (ChatGPT) splitting
Error taxonomy including 'Fooled error' for prompt-sensitivity failures

Modeling

Base Model: ChatGPT (gpt-3.5-turbo) used for response generation

Training Method: Zero-shot generation (Evaluation/Benchmarking paper)

Adaptation: None

Training Data:

Total samples: 817
Total segments: 3948
Domains: World Knowledge, Science/Tech, Math, Reasoning, Writing/Rec

Compute: Not reported in the paper

Comparison to Prior Work

vs. HaluEval: FELM collects errors from real scenarios (natural prompting) rather than induced errors.
vs. TruthfulQA: FELM covers broader domains like Math and Reasoning, not just world knowledge/misconceptions.
vs. Standard Factuality Metrics (e.g., for Summarization): FELM evaluates open-ended generation where the 'source' document is not provided (requires external retrieval).

Limitations

Segmentation is subjective and relies on heuristics or ChatGPT, which may not be perfect.
Annotation is resource-intensive, limiting dataset size (817 samples) compared to purely synthetic benchmarks.
Reference links provided by annotators may change or become unavailable over time.

Reproducibility

Code: https://github.com/hkust-nlp/felm

publicly available (https://github.com/hkust-nlp/felm). Dataset includes prompts, responses, segments, and annotations. Paper details prompt sources (TruthfulQA, MMLU, etc.) and specific segmentation prompts used.

📊 Experiments & Results

Evaluation Setup

Benchmarking LLMs as factuality evaluators on the constructed FELM dataset.

Benchmarks:

FELM (Factuality Detection) [New]

Metrics:

F1 Score
Accuracy
Precision
Recall
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
FELM	Response-level Error Rate	Not reported in the paper	31.8%	Not reported in the paper
FELM	Inter-Annotator Agreement	Not reported in the paper	90.7%	Not reported in the paper

Experiment Figures

Distribution of prompt sources and error types across the five domains.

Main Takeaways

Factuality error detection remains a challenging task for current LLMs (ChatGPT, GPT-4), even when augmented with retrieval or Chain-of-Thought.
Retrieval mechanisms help improve factuality evaluation but are not a complete solution.
Claim-based evaluators (extracting atomic facts) are generally more effective than segment-based or response-based evaluators (qualitative finding discussed in text).
LLMs struggle specifically with 'Reasoning' and 'Math' domains compared to standard World Knowledge.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM hallucinations/factuality issues
Basics of dataset creation (annotation, inter-annotator agreement)
Zero-shot prompting

Key Terms

World Knowledge: Domain concerning specific entities like movies, countries, dates, and people.

Reasoning Error: An error type arising when a claim employs flawed reasoning, faulty logic, or incorrect mathematical calculation.

Fooled Error: An error type where the model fails to recognize falsehoods or jokes in the prompt and provides an inaccurate response.

Chain-of-Thought: Prompting technique where the model generates reasoning steps before the final answer.

Inter-annotator agreement: A statistical measure of how much two or more human annotators agree on their labels (used here to validate dataset quality).