← Back to Paper List

FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs

Forrest Sheng Bao, Miaoran Li, Renyi Qu, Ge Luo, Erana Wan, Yujia Tang, Weisi Fan, Manveer Singh Tamber, Suleman Kazi, Vivek Sourabh, Mike Qi, Ruixuan Tu, Chenyu Xu, Matthew Gonzales, Ofer Mendelevitch, Amin Ahmad
Vectara, Inc., Palo Alto, CA, Iowa State University, Ames, IA, University of Southern California, Los Angeles, CA, Entropy Technologies, Melbourne, Australia, University of Waterloo, Waterloo, ON, Funix.io, Iowa City, IA, University of Wisconsin, Madison, WI
arXiv (2024)
Factuality Benchmark

📝 Paper Summary

Hallucination detection Summarization evaluation
FaithBench is a human-annotated benchmark focusing on challenging hallucinations in summaries generated by 10 diverse modern LLMs, introducing categories for benign and questionable errors.
Core Problem
Existing hallucination benchmarks rely on outdated LLMs, lack model diversity, and use detectors with low accuracy (often <80%), failing to capture the nuance of modern model errors.
Why it matters:
  • Current benchmarks often test easy hallucinations that automatic systems can already catch, missing the subtle errors modern models make.
  • Leaderboards using only one or two model families (like GPT) bias results, ignoring how different architectures (Gemini, Claude, Llama) hallucinate differently.
  • Binary labels (hallucinated vs. faithful) ignore the subjectivity of 'benign' hallucinations that users might actually value (e.g., added reasoning or external facts).
Concrete Example: If a passage says 'water has a smell' and the summary says 'water is odorless', this is factually true but unfaithful to the source. Existing benchmarks might flag this inconsistently, whereas FaithBench categorizes it specifically using a granular taxonomy.
Key Novelty
Diverse, Difficult, and Nuanced Hallucination Benchmark
  • Constructs a dataset of 'challenging' samples where popular automated detectors (GPT-4o, HHEM, etc.) disagree, rather than obvious errors.
  • Expands binary labels to a 4-class taxonomy: Consistent, Benign (hallucinated but acceptable), Questionable (subjective), and Unwanted (harmful).
  • Includes summaries from 10 distinct modern LLMs across 8 families (GPT, Llama, Gemini, Mistral, Phi, Claude, Command-R, Qwen) to ensure diversity.
Evaluation Highlights
  • State-of-the-art hallucination detectors achieve only ~50-58% balanced accuracy on FaithBench, highlighting the difficulty of these samples.
  • GPT-4o produces the fewest hallucinations, followed by GPT-3.5-Turbo and Gemini-1.5-Flash.
  • Claude-3.5-Sonnet produces a significant number (21.31%) of 'benign' hallucinations—content not in the source but acceptable to humans.
Breakthrough Assessment
8/10
Significant contribution by focusing on 'hard' samples where current detectors fail and introducing necessary nuance (benign vs. unwanted) into evaluation. The low detector accuracy proves the benchmark's utility.
×