An Audit on the Perspectives and Challenges of Hallucinations in NLP

📝 Paper Summary

Definition and Taxonomy of Hallucination Metrics and Evaluation

An audit of 103 NLP papers and a survey of 171 practitioners reveal a lack of consensus on definitions and metrics for LLM hallucinations, with significant divergence between academic frameworks and real-world perceptions.

Core Problem

The term 'hallucination' is used inconsistently across NLP research without a unified definition or measurement framework, leading to fragmented understanding and mismatched mitigation strategies.

Why it matters:

57.3% of audited papers discussing hallucination do not even define the term, creating ambiguity in research goals
The lack of consensus risks misappropriating the term across diverse contexts like image captioning vs. text generation
Practitioners and researchers disagree on terminology, with many preferring terms like 'fabrication' or 'confabulation' to avoid anthropomorphism

Concrete Example: One paper defines hallucination as 'nonsensical' output, while another defines it as 'plausible but unfaithful'. A practitioner might view a creative story generation as a feature, while a medical researcher views the same 'hallucination' as a critical failure.

Key Novelty

Dual-Method Audit of Hallucination Conceptualization

Systematically audits 103 peer-reviewed NLP publications to categorize how hallucination is defined (or not) and measured (statistical vs. data-driven vs. human)
Conducts a survey of 171 NLP/AI practitioners to contrast academic definitions with real-world perceptions, revealing a preference for terms like 'fabrication' and recognizing creative uses of hallucination

Evaluation Highlights

Only 42.7% of the 103 audited papers explicitly define 'hallucination', with the majority (57.3%) providing no definition despite focusing on the topic
40.46% of surveyed practitioners prefer the term 'Fabrication' over 'Hallucination' to describe the phenomenon, citing the latter's improper anthropomorphism
92% of survey respondents view hallucination as a weakness, yet ~12% identify positive correlations with creativity in tasks like storytelling

Breakthrough Assessment

7/10

Provides critical meta-analysis rather than a new model. Highlights significant methodological flaws in the field (undefined terms, inconsistent metrics) that hinder progress.

⚙️ Technical Details

Problem Definition

Setting: Meta-analysis and survey research

Inputs: 103 peer-reviewed NLP papers; 171 survey responses from NLP/AI practitioners

Outputs: Taxonomy of hallucination definitions, audit of measurement metrics, and analysis of practitioner sentiment

Pipeline Flow

Literature Search (ACL Anthology)
Thematic Analysis (Paper Audit)
Practitioner Survey Design
Survey Distribution & Analysis

System Modules

Literature Search (Data Collection)

Identify relevant papers using keywords like 'hallucination', 'fabrication', 'confabulation'

Model or implementation: Search over ACL Anthology

Thematic Analysis

Categorize definitions, metrics, and tasks in the collected papers

Model or implementation: Iterative thematic analysis

Practitioner Survey (Data Collection)

Gather perceptions from researchers and industry practitioners

Model or implementation: Survey (14 open/close-ended questions)

Novel Architectural Elements

None (This is a survey/audit paper, not a system architecture paper)

Comparison to Prior Work

vs. Ji et al.: This work critically audits the *lack* of definitions rather than just summarizing existing ones
vs. Zhang et al.: Incorporates a practitioner survey to contrast academic theory with real-world usage
vs. Huang et al.: Focuses on the sociotechnical gap and the inconsistency of metrics used across the field

Limitations

Survey sample (171 respondents) may not fully represent the global NLP community
Analysis is limited to papers available in the ACL anthology up to April 2024
Does not propose a new metric or model to fix the identified problems, only highlights them

Reproducibility

Code: https://github.com/PranavNV/The-Thing-Called-Hallucination

The list of 103 audited papers and the survey questions are available at https://github.com/PranavNV/The-Thing-Called-Hallucination.

📊 Experiments & Results

Evaluation Setup

Audit of 103 papers and analysis of 171 survey responses

Benchmarks:

Literature Corpus (Meta-analysis) [New]
Practitioner Survey (Opinion Survey) [New]

Metrics:

Percentage of papers defining hallucination
Percentage of papers acknowledging sociotechnical factors
Prevalence of specific metrics (Statistical, Data-driven, Human)
Practitioner familiarity and sentiment
Statistical methodology: Descriptive statistics for close-ended survey questions; Thematic analysis for open-ended responses

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Audit of the literature reveals a significant lack of clarity and consensus in defining hallucination.
Literature Corpus	Percentage defining 'hallucination'	N/A	42.7%	N/A
Literature Corpus	Percentage acknowledging sociotechnical nature	N/A	2.9%	N/A
Audit of metrics shows a fragmented landscape with heavy reliance on statistical proxies.
Literature Corpus	Use of Statistical Metrics	N/A	35.2%	N/A
Literature Corpus	Use of Mixed Methodologies	N/A	28.4%	N/A
Practitioner survey reveals strong preference for alternative terminology.
Practitioner Survey	Preference for 'Fabrication'	N/A	40.46%	N/A
Practitioner Survey	Perception as Weakness	N/A	92%	N/A

Main Takeaways

The field suffers from a lack of shared vocabulary; over half of the papers discussing hallucination fail to define it.
There is a disconnect between the term 'hallucination' (anthropomorphic) and technical reality, with many practitioners preferring 'fabrication' or 'confabulation'.
Measurement is highly inconsistent, utilizing 25 different statistical metrics and 18 distinct datasets across just 103 papers.
A minority of practitioners (~12%) view hallucination positively as a driver of creativity, a perspective largely absent from the audited literature.

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Large Language Models (LLMs) and their evaluation
Understanding of common NLP metrics (BLEU, ROUGE, etc.)

Key Terms

Hallucination: In this context, broadly refers to model generation of non-existent objects, unfaithful information, or factual errors, though definitions vary widely.

Intrinsic Hallucination: Generations that contradict the source input provided to the model.

Extrinsic Hallucination: Generations that contain information not present in the source input (which may or may not be factually true in the real world).

Sociotechnical: An approach that considers the interaction between social systems (people, institutions) and technical systems (algorithms, models).

Fabrication: An alternative term preferred by some practitioners, implying the creation of false information without the sensory implication of 'hallucination'.

Confabulation: An alternative term referring to the creation of false memories or information without the intent to deceive.

CHAIR: Caption Hallucination Assessment with Image Relevance—a metric for evaluating hallucinations in image captioning.

Faithfulness: The degree to which the generated output accurately reflects the information in the source input.

Factuality: The degree to which the generated output aligns with real-world facts.