← Back to Paper List

A Survey of Automatic Hallucination Evaluation on Natural Language Generation

Siya Qi, Yulan He, Zheng Yuan
King’s College London, The Alan Turing Institute, The University of Sheffield
arXiv (2024)
Factuality Benchmark QA RAG KG

📝 Paper Summary

Hallucination Evaluation Factuality Benchmarks Faithfulness Metrics
This survey analyzes 105 hallucination evaluation methods, proposing a taxonomy that distinguishes between Source Faithfulness and World Factuality to organize the shift from pre-LLM to post-LLM approaches.
Core Problem
The field of Automatic Hallucination Evaluation (AHE) is fragmented, with conflated definitions of faithfulness and factuality, limiting progress as methods shift from task-specific models to general-purpose LLMs.
Why it matters:
  • LLMs frequently generate fluent but factually incorrect content, posing safety risks in real-world deployment
  • Existing terms like 'hallucination' are often used ambiguously to cover both contradictions of source input and contradictions of world knowledge
  • Current evaluation methods lack a unified framework, making it difficult to compare progress across different eras of model development (pre-LLM vs. post-LLM)
Concrete Example: A model might generate 'Taylor Swift's album 1988 topped sales'. If the source text says '1989', this is a Source Faithfulness Error. If the source text actually contained the typo '1988' and the model copied it, the output is Faithful but contains a World Factual Error. Current evaluators often fail to distinguish these cases.
Key Novelty
Unified Source Faithfulness (SF) vs. World Factuality (WF) Taxonomy
  • Distinguishes between Source Faithfulness (consistency with input) and World Factuality (alignment with real-world knowledge) to resolve terminological ambiguity
  • Classifies 105 evaluation methods into three paradigms: Reference-based (checking against evidence), Reference-free (checking internal consistency/uncertainty), and LLM-based (using models as judges)
  • Systematically compares pre-LLM task-specific methods (summarization, translation) with post-LLM generalist methods (instruction following, reasoning)
Evaluation Highlights
  • Analyzed 105 total evaluation methods, finding that 77.1% specifically target LLMs, marking a paradigm shift from earlier task-specific metrics
  • Identified that 46.7% of surveyed evaluators release their own specific benchmarks, indicating a fragmentation of testing standards
  • Cataloged over 50 specific datasets across categories like General Factuality (e.g., TruthfulQA), Application-specific (e.g., MedHalt), and Meta-evaluation (e.g., SummEval)
Breakthrough Assessment
9/10
A comprehensive, structured survey that provides much-needed clarity on definitions (SF vs. WF) and organizes a chaotic field. Essential reading for understanding the landscape of hallucination evaluation.
×