On the Origin of Hallucinations in Conversational Models: Is it the Datasets or the Models?

📝 Paper Summary

Knowledge-grounded dialogue Dataset auditing

Auditing standard dialogue benchmarks reveals they consist of over 60% hallucinated responses, which models then amplify during training rather than mitigate.

Core Problem

Knowledge-grounded conversational models frequently generate hallucinations, but it is unclear whether this stems primarily from the models themselves or the underlying training data.

Why it matters:

Existing datasets are crowdsourced with loose incentives, encouraging workers to ignore knowledge snippets and use personal opinions or external knowledge.
Neural models trained on noisy data may not only replicate but amplify hallucination behavior at inference time.
Improving models to be faithful is futile if the ground-truth benchmarks themselves are fraught with unverifiable content.

Concrete Example: In the Wizard of Wikipedia dataset, a 'wizard' ignores the provided text about science fiction's consequences and instead claims the show 'Fringe' is 'incredibly well written'—a subjective opinion not found in the source text. A model trained on this learns to hallucinate subjective content.

Key Novelty

Comprehensive Dataset Audit & Amplification Analysis

Conducts a large-scale human annotation of three major benchmarks (Wizard of Wikipedia, CMU-DoG, TopicalChat) using the BEGIN taxonomy to quantify hallucination levels.
Analyzes the 'Verbal Response Modes' (VRM) to categorize hallucinations linguistically (e.g., Disclosure of opinion vs. Edification of facts).
Measures 'hallucination amplification' by comparing the hallucination rates of trained SOTA models against the hallucination rates of the gold training data.

Architecture

An example of a hallucinated conversation from the Wizard of Wikipedia dataset.

Evaluation Highlights

Over 60% of responses in standard benchmarks (WoW, CMU-DoG, TopicalChat) are hallucinated (not supported by the source text).
State-of-the-art models amplify hallucination: GPT2 trained on WoW increases fully hallucinated responses by 19.2% compared to the gold data.
Subjective 'Disclosure' (opinions/feelings) is the primary mode of hallucination, accounting for >50% of hallucinations in all three datasets.

Breakthrough Assessment

9/10

A seminal audit paper that fundamentally challenged the reliability of standard benchmarks. It shifted the field's focus from purely architectural fixes to data quality and faithfulness.

⚙️ Technical Details

Problem Definition

Setting: Knowledge-grounded dialogue generation where a response must be inferred exclusively from a provided knowledge snippet.

Inputs: Dialogue history and a relevant knowledge snippet.

Outputs: A natural language response.

Pipeline Flow

Input (Knowledge + History)
Model Processing (GPT2 / DoHA / CTRL)
Response Generation (Nucleus Sampling)

System Modules

Base Models

Generate conversational responses conditioned on knowledge

Model or implementation: GPT2, DoHA (BART-based), CTRL

Novel Architectural Elements

This is an analysis paper; it does not propose a new architecture but audits existing ones (GPT2, DoHA, CTRL).

Modeling

Base Model: GPT2, BART (for DoHA), CTRL

Training Method: Fine-tuning on specific dialogue datasets (Wizard of Wikipedia, CMU-DoG, TopicalChat)

Objective Functions:

Purpose: Standard Language Modeling.

Formally: Maximum Likelihood Estimation (MLE)

Training Data:

Wizard of Wikipedia (WoW): 82,722 train samples
CMU-DoG: 78,136 train samples
TopicalChat: 292,215 train samples

Key Hyperparameters:

learning_rate: 6.25e-5
batch_size: 32
optimizer: Adam
+ 1 more
decoding_p: 0.6 (Nucleus Sampling)

Compute: Nvidia V100 GPU 32GB; Training time 1.5-10 hours depending on dataset/model

Comparison to Prior Work

vs. GPT2/DoHA: This paper analyzes them as subjects of the audit rather than proposing them.
vs. CTRL: Shows that even models designed for control (CTRL) can fail to reduce hallucinations or trade faithfulness for coherence (becoming extractive/uncooperative).

Limitations

Analysis limited to English benchmarks only.
Does not investigate root causes of hallucination within the model architectures (e.g., specific attention failures), focusing instead on data correlation.
The 'Uncooperative' category in evaluation includes extractive responses that are faithful but conversationally poor, complicating the faithfulness vs. quality trade-off.

Reproducibility

Code: https://github.com/McGill-NLP/FaithDial

publicly available annotations (https://github.com/McGill-NLP/FaithDial). Code for DoHA and GPT2 based on standard libraries (HuggingFace, PyTorch Lightning). CTRL implemented by authors based on discussions with original creators.

📊 Experiments & Results

Evaluation Setup

Human annotation of gold training data and model-generated responses using BEGIN and VRM taxonomies.

Benchmarks:

Wizard of Wikipedia (WoW) (Knowledge-grounded dialogue)
CMU-DoG (Movie domain document-grounded dialogue)
TopicalChat (Open-domain knowledge-grounded dialogue)

Metrics:

Hallucination Rate (%)
Entailment Rate (%)
VRM Distribution (Disclosure, Edification, etc.)
Fleiss' Kappa (Inter-annotator agreement)
Statistical methodology: Fleiss' Kappa reported for agreement (e.g., 0.89 on WoW BEGIN). Expert annotation on 200 samples; AMT annotation on 4000 WoW samples.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Audit of human-written 'gold' data reveals high levels of hallucination across all three benchmarks.
Wizard of Wikipedia	Overall Hallucination Rate (Gold Data)	0	62.0	+62.0
CMU-DoG	Overall Hallucination Rate (Gold Data)	0	66.5	+66.5
TopicalChat	Overall Hallucination Rate (Gold Data)	0	63.9	+63.9
Model evaluation shows amplification of hallucination compared to the training data.
Wizard of Wikipedia	Overall Hallucination Rate	62.0	81.2	+19.2
CMU-DoG	Overall Hallucination Rate	66.5	81.5	+15.0
Wizard of Wikipedia	Overall Hallucination Rate	62.0	36.0	-26.0

Experiment Figures

Sunburst charts showing the breakdown of BEGIN categories (Hallucination, Entailment, etc.) and VRM types for Wizard of Wikipedia.

Main Takeaways

Standard benchmarks are >60% hallucinated because human crowdworkers prioritize 'informativeness' and personal opinions (Disclosure) over faithfulness to the snippet.
Neural models amplify this behavior: GPT2 increases hallucination rates by ~15-19% over the gold data, indicating hallucination patterns are easier to learn than entailment.
CTRL reduces hallucination but fails to generate natural dialogue, instead reverting to 'Uncooperative' behavior (copy-pasting text without conversational coherence).
The primary linguistic strategy for hallucination is 'Disclosure' (subjective feelings/opinions), accounting for >50% of hallucinations in gold data.

📚 Prerequisite Knowledge

Prerequisites

Understanding of knowledge-grounded dialogue tasks
Familiarity with hallucination in text generation

Key Terms

hallucination: A response containing information that cannot be fully verified from the source knowledge snippet (even if true in the real world).

entailment: A response fully supported by the knowledge snippet; all information is attributable to the source.

VRM: Verbal Response Modes—a taxonomy for classifying speech acts (e.g., Disclosure, Edification, Advisement).

BEGIN: Benchmark for Evaluation of Grounded Interaction—a taxonomy used to classify response groundedness (Entailment, Hallucination, Generic, Uncooperative).

disclosure: A VRM category where the speaker reveals subjective opinions, thoughts, feelings, or personal experiences.

edification: A VRM category concerning objective information.

uncooperative: A response that is entailed by the source but does not follow conversational principles (e.g., incoherent with history, purely extractive).

nucleus sampling: A decoding strategy that samples from the smallest set of top vocabulary tokens whose cumulative probability exceeds a threshold p.