FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs

📝 Paper Summary

Hallucination detection Summarization evaluation

FaithBench is a human-annotated benchmark focusing on challenging hallucinations in summaries generated by 10 diverse modern LLMs, introducing categories for benign and questionable errors.

Core Problem

Existing hallucination benchmarks rely on outdated LLMs, lack model diversity, and use detectors with low accuracy (often <80%), failing to capture the nuance of modern model errors.

Why it matters:

Current benchmarks often test easy hallucinations that automatic systems can already catch, missing the subtle errors modern models make.
Leaderboards using only one or two model families (like GPT) bias results, ignoring how different architectures (Gemini, Claude, Llama) hallucinate differently.
Binary labels (hallucinated vs. faithful) ignore the subjectivity of 'benign' hallucinations that users might actually value (e.g., added reasoning or external facts).

Concrete Example: If a passage says 'water has a smell' and the summary says 'water is odorless', this is factually true but unfaithful to the source. Existing benchmarks might flag this inconsistently, whereas FaithBench categorizes it specifically using a granular taxonomy.

Key Novelty

Diverse, Difficult, and Nuanced Hallucination Benchmark

Constructs a dataset of 'challenging' samples where popular automated detectors (GPT-4o, HHEM, etc.) disagree, rather than obvious errors.
Expands binary labels to a 4-class taxonomy: Consistent, Benign (hallucinated but acceptable), Questionable (subjective), and Unwanted (harmful).
Includes summaries from 10 distinct modern LLMs across 8 families (GPT, Llama, Gemini, Mistral, Phi, Claude, Command-R, Qwen) to ensure diversity.

Evaluation Highlights

State-of-the-art hallucination detectors achieve only ~50-58% balanced accuracy on FaithBench, highlighting the difficulty of these samples.
GPT-4o produces the fewest hallucinations, followed by GPT-3.5-Turbo and Gemini-1.5-Flash.
Claude-3.5-Sonnet produces a significant number (21.31%) of 'benign' hallucinations—content not in the source but acceptable to humans.

Breakthrough Assessment

8/10

Significant contribution by focusing on 'hard' samples where current detectors fail and introducing necessary nuance (benign vs. unwanted) into evaluation. The low detector accuracy proves the benchmark's utility.

⚙️ Technical Details

Problem Definition

Setting: Task-based faithfulness evaluation (specifically summarization) where generated text must adhere to the input passage.

Inputs: A source passage P and an LLM-generated summary S.

Outputs: A label classifying the summary as Consistent, Benign, Questionable, or Unwanted (Intrinsic/Extrinsic).

Pipeline Flow

Data Sourcing (Vectara Leaderboard)
Filtering by LLM Diversity (10 models)
Filtering for Difficulty (Detector Disagreement)
Human Annotation

System Modules

Data Source (Data Preparation)

Provide initial pool of passage-summary pairs

Model or implementation: Vectara Hallucination Leaderboard dataset

Model Selector (Data Preparation)

Select diverse modern LLMs

Model or implementation: Rules: Smallest version of latest generation for 8 families

Difficulty Filter

Identify 'challenging' samples where detectors disagree

Model or implementation: Ensemble of detectors (True-NLI, TrueTeacher, HHEM-2.1, GPT-as-judge)

Human Annotation

Assign ground truth labels and span-level justifications

Model or implementation: Human Experts (11 annotators)

Novel Architectural Elements

Pipeline focused specifically on 'challenging' samples defined by detector disagreement rather than random sampling
Four-tier taxonomy (Consistent, Benign, Questionable, Unwanted) to capture gray areas in hallucination

Modeling

Base Model: Benchmark construction paper; evaluates 10 external LLMs

Comparison to Prior Work

vs. Vectara Leaderboard: Adds human annotation and focuses only on hard samples where HHEM/others fail
vs. AggreFact/RAGTruth: Covers 8 modern model families (Gemini, Claude 3.5, Llama 3) vs. older models; introduces 'benign' category
vs. HaluEval [not cited in paper]: FaithBench uses natural outputs from modern LLMs rather than synthetically induced hallucinations via ChatGPT

Limitations

Covers only summarization tasks, not QA or dialogue.
Restricted to short contexts (approx. 137-494 tokens), limiting evaluation of long-context hallucination.
Only covers challenging samples; rankings may not reflect performance on 'easy' general-purpose data.
Low inter-annotator agreement on 'questionable' and 'benign' labels indicates continued subjectivity.
Limited model size diversity within families (mostly smallest versions used).

Reproducibility

Code: https://github.com/vectara/FaithBench

publicly available (https://github.com/vectara/FaithBench). The repository contains the benchmark data and evaluation scripts. Annotator guidelines and definitions are provided in the paper.

📊 Experiments & Results

Evaluation Setup

Human evaluation of LLM summaries using a custom interface

Benchmarks:

FaithBench (Summarization Hallucination Detection) [New]

Metrics:

Hallucination Rate (Unwanted + Questionable / Total)
Balanced Accuracy (for detectors)
F1-Macro (for detectors)
Krippendorff's alpha (Inter-Annotator Agreement)
Statistical methodology: Krippendorff’s alpha for agreement. No significance tests reported for model rankings.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
LLM Hallucination Rankings on FaithBench Challenging Samples
FaithBench	Hallucination Rate (lower is better)	69.70	57.38	-12.32
FaithBench	Hallucination Rate (lower is better)	57.38	63.93	+6.55
Detector Performance on FaithBench (Evaluating the evaluators)
FaithBench	Balanced Accuracy	50.00	58.00	+8.00
FaithBench	F1-Macro	54.00	55.00	+1.00

Experiment Figures

Distribution of sample-level labels (Consistent, Benign, Questionable, Unwanted) per LLM

Main Takeaways

GPT-4o and GPT-3.5-Turbo produce the fewest hallucinations on challenging samples, outperforming newer open models like Llama-3.
Claude-3.5-Sonnet has a high rate of 'benign' hallucinations, suggesting it adds useful but unfaithful information more often than others.
Even state-of-the-art detectors (GPT-4o, HHEM-2.1) struggle significantly with FaithBench, achieving near-random (50-58%) balanced accuracy, proving the benchmark's difficulty.
Subjectivity remains a major challenge: Inter-annotator agreement drops significantly when introducing 'benign' and 'questionable' categories compared to binary labeling.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Retrieval-Augmented Generation (RAG) failure modes
Familiarity with hallucination detection metrics (NLI, LLM-as-a-judge)
Basic knowledge of inter-annotator agreement metrics

Key Terms

HHEM-2.1-Open: Vectara's open-source hallucination detection model used as a filter for selecting challenging samples

TrueTeacher: A Google model used to generate synthetic training data or labels for hallucination detection

True-NLI: A Natural Language Inference model used to check if a summary is entailed by the source document

LLM-as-a-judge: Using a strong LLM (like GPT-4) to evaluate the outputs of other models

Benign hallucination: Information not in the source text but supported by world knowledge or reasoning, considered acceptable by readers

Intrinsic hallucination: Generated content that explicitly contradicts the source passage

Extrinsic hallucination: Generated content that is neither supported by the passage nor inferable from it, nor factual

Krippendorff’s alpha: A statistical measure of the agreement achieved when coding a set of units of analysis