DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models

📝 Paper Summary

Hallucination Detection Dialogue Systems

DiaHalu is the first dedicated benchmark for evaluating both factuality and faithfulness hallucinations in multi-turn dialogues across diverse domains, generated naturally by LLMs and manually annotated.

Core Problem

Existing hallucination benchmarks typically focus on non-interactive sentence/passage levels, rely on artificially induced (not naturally generated) hallucinations, and often ignore faithfulness issues like incoherence or self-contradiction.

Why it matters:

LLMs are widely used in dialogue (chatbots), where unique hallucination types like self-contradiction across turns are critical but under-evaluated
Most current benchmarks only check if facts align with the world (factuality), neglecting whether the model aligns with user instructions or its own context (faithfulness)
Artificial triggers in existing datasets do not reflect the natural distribution of errors LLMs make in real-world daily usage

Concrete Example: In a task-oriented dialogue about booking a train, an LLM might initially say no trains are available, but in the next turn offer a specific train time (faithfulness/consistency hallucination). Current factuality benchmarks would miss this context-conflicting error because they treat sentences in isolation.

Key Novelty

Natural Multi-Turn Hallucination Benchmark

Simulates natural human-machine interaction by having two LLMs converse (with human alignment for one role) to generate authentic multi-turn contexts
Expands hallucination taxonomy beyond factuality to include faithfulness subtypes: Incoherence, Irrelevance, Overreliance, and Reasoning Error
Provides granular annotation at the dialogue level, identifying exactly which turn and subtype exhibits the hallucination

Architecture

The data construction pipeline for DiaHalu

Evaluation Highlights

Hallucination rates are notably high in knowledge-grounded (32.8%) and reasoning (35.2%) dialogues compared to chit-chat (12.4%)
Faithfulness hallucinations (incoherence, irrelevance) constitute a significant portion of errors in task-oriented and chit-chat scenarios, often dominating factuality errors
Human annotation achieved an Inter-Annotator Agreement (Fleiss’s Kappa) of 0.8842, indicating high reliability of the dataset labels

Breakthrough Assessment

8/10

Significant contribution as the first dedicated dialogue-level hallucination benchmark covering diverse domains and faithfulness subtypes, filling a clear gap in current evaluation landscapes.

⚙️ Technical Details

Problem Definition

Setting: Multi-turn dialogue generation and hallucination detection

Inputs: A multi-turn dialogue context between a User and a System

Outputs: Binary labels (Hallucination/Not), specific subtypes (e.g., Non-factual, Incoherence), and span locations

Pipeline Flow

Topic Collection: Gather topics from existing datasets (TruthfulQA, MultiWOZ, etc.) and GPT-4
Dialogue Generation: Use ChatGPT/GPT-4 to simulate self-dialogues based on system prompts
Human Alignment: Manually edit 'User' turns in task/knowledge domains to ensure naturalness, then regenerate 'System' responses
Annotation: Expert annotation of hallucinations, subtypes, and explanations

System Modules

Topic Collector (Data Construction)

Aggregate diverse topics for 4 domains

Model or implementation: N/A (Aggregation)

Dialogue Generator (Data Construction)

Generate initial multi-turn conversations

Model or implementation: ChatGPT (GPT-3.5) or GPT-4

Human Aligner (Data Construction)

Refine dialogue to match human language patterns

Model or implementation: Human Annotators + LLM regeneration

Novel Architectural Elements

Hybrid generation pipeline: Self-play between LLMs followed by human intervention on one side (the 'User' side) to enforce realistic human-machine interaction patterns while keeping the 'System' side naturally hallucinated

Modeling

Base Model: ChatGPT-3.5 and GPT-4 (used for dataset generation)

Comparison to Prior Work

vs. HaluEval: DiaHalu focuses on dialogue-level context and faithfulness subtypes, whereas HaluEval is largely QA/summarization focused
vs. FactCHD: DiaHalu includes naturally generated errors rather than KG-based constructions
vs. TruthfulQA: DiaHalu covers multi-turn dynamics and faithfulness (consistency), not just single-turn factuality
+ 1 more
vs. PHD [not cited in paper]: PHD focuses on passage-level hallucination detection, whereas DiaHalu specifically targets multi-turn dialogue dynamics

Limitations

Reliance on proprietary models (GPT-3.5/4) for data generation may introduce specific model biases
Manual alignment process is labor-intensive and may not scale to massive dataset sizes
Detection experiments primarily use prompt-based methods rather than training specialized models
Focus is on English language dialogues only

Reproducibility

Code: https://github.com/ECNU-ICALK/DiaHalu

📊 Experiments & Results

Evaluation Setup

Evaluate LLMs and detection methods on the DiaHalu benchmark for hallucination identification

Benchmarks:

DiaHalu (Dialogue-level Hallucination Detection) [New]

Metrics:

Accuracy
F1 score
Precision
Recall
Statistical methodology: Fleiss's Kappa reported for inter-annotator agreement

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Analysis of hallucination rates across different dialogue domains in the constructed benchmark.
DiaHalu	Hallucination Rate (Knowledge-grounded)	N/A	32.8%	N/A
DiaHalu	Hallucination Rate (Reasoning)	N/A	35.2%	N/A
DiaHalu	Hallucination Rate (Chit-Chat)	N/A	12.4%	N/A
DiaHalu	Hallucination Rate (Task-oriented)	N/A	19.6%	N/A

Experiment Figures

Distribution of hallucination subtypes across the four dialogue domains

Main Takeaways

Hallucinations are domain-dependent: Reasoning and Knowledge-grounded tasks trigger significantly more hallucinations than Chit-Chat.
Faithfulness issues (Incoherence, Irrelevance) are pervasive in Chit-Chat and Task-oriented dialogues, challenging the assumption that hallucination is purely a factuality problem.
Existing detection methods (like simple prompting or uncertainty metrics) struggle with the subtle context-dependent hallucinations in DiaHalu, proving it is a challenging benchmark.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and hallucination phenomena
Familiarity with dialogue systems (task-oriented, chit-chat, knowledge-grounded)
Basic knowledge of hallucination subtypes (factuality vs. faithfulness)

Key Terms

Faithfulness Hallucination: Errors where the model's output contradicts the input instructions, context, or its own previous statements (internal consistency)

Factuality Hallucination: Errors where the model's output contradicts established real-world facts

Overreliance: A type of hallucination where the model excessively trusts user input or context, often agreeing with false premises or answering unanswerable questions

Knowledge-grounded Dialogue: Conversations focused on exchanging information or discussing specific knowledge topics

Task-oriented Dialogue: Conversations aimed at completing a specific user goal, like booking a ticket or finding a restaurant

Fleiss's Kappa: A statistical measure for assessing the reliability of agreement between a fixed number of raters