Detecting Hallucinations in Authentic LLM-Human Interactions

📝 Paper Summary

Hallucination detection benchmark Authentic user-LLM interaction analysis

AuthenHallu is a hallucination detection benchmark built entirely from naturally occurring LLM-human dialogues, revealing that real-world hallucinations differ significantly from those in artificially induced datasets.

Core Problem

Existing hallucination benchmarks rely on deliberately induced or simulated hallucinations, which fail to capture the complex distribution of genuine user queries and natural model errors found in real-world usage.

Why it matters:

Artificial benchmarks (like HaluEval) force models to hallucinate, creating data distributions that deviate from how models actually behave in deployment
Simulated benchmarks (like FELM) use simplified queries that lack the diversity and complexity of real human intent
Trustworthy deployment requires evaluating detection systems on ecologically valid data where hallucinations emerge organically

Concrete Example: In induced benchmarks, a model might be explicitly told 'write a plausible but incorrect answer.' In AuthenHallu, a user naturally asks a math problem, and the model attempts to solve it but fails (60% hallucination rate in math), reflecting a capability gap rather than instruction following.

Key Novelty

Ecologically Valid Hallucination Benchmarking

Constructs the first hallucination benchmark derived entirely from the LMSYS-Chat-1M log of one million real-world conversations, rather than using synthetic prompts
Employs a rigorous filtering and clustering pipeline to select 400 representative dialogues (800 pairs) covering diverse topics like Math, Coding, and Roleplay
Provides granular human annotation for three hallucination types (Fact-conflicting, Input-conflicting, Context-conflicting) on authentic data

Architecture

The construction pipeline of the AuthenHallu benchmark.

Evaluation Highlights

31.4% of authentic query-response pairs contain hallucinations, with fact-conflicting hallucinations being the most prevalent (62.5% of errors)
Hallucination rates vary drastically by topic: 'Math & Number Problems' has the highest rate at 60.0%, followed by 'Dates, Time & Calendar' at 60.0%
Vanilla LLMs (evaluated as detectors) perform insufficiently on authentic data, struggling to identify these naturally occurring errors

Breakthrough Assessment

8/10

Significant contribution by shifting the evaluation paradigm from synthetic/induced hallucinations to authentic wild data. Essential for realistic assessment, though the dataset size (800 pairs) is relatively small compared to synthetic ones.

⚙️ Technical Details

Problem Definition

Setting: Binary classification and multi-class categorization of hallucinations in dialogue

Inputs: A dialogue history containing user query Q and LLM response R

Outputs: Label: {Hallucination, No Hallucination} and Category: {Input-conflicting, Context-conflicting, Fact-conflicting}

Pipeline Flow

Data Source (LMSYS-Chat-1M)
Filtering (Safety, Length, Language)
Clustering (Sentence Transformer + K-Means)
Selection (Proportional Sampling)
Annotation (Human Expert Labeling)

System Modules

Dialogue Filter (Data Processing)

Clean raw logs to ensure quality and manageability

Model or implementation: Rule-based + OpenAI Moderation API

Semantic Clusterer (Data Processing)

Group queries by intent to ensure benchmark diversity

Model or implementation: all-mpnet-base-v2 (Encoder) + K-Means

Annotator

Label hallucinations and categories

Model or implementation: Human Experts

Novel Architectural Elements

Pipeline designed specifically for extracting representative samples from wild chat logs rather than synthesizing them
Two-stage clustering approach to handle multi-turn dialogue diversity

Comparison to Prior Work

vs. HaluEval: AuthenHallu uses naturally occurring errors, whereas HaluEval forces errors (induced)
vs. FELM: AuthenHallu uses real user queries from chat logs, whereas FELM uses collected/synthetic queries that may lack conversational complexity
vs. PHD: AuthenHallu covers diverse domains (Math, Roleplay), whereas PHD focuses on factual entities

Limitations

Scale is limited to 800 query-response pairs due to the high cost of manual annotation
Focused on English dialogues only
Restricted to dialogues with exactly two turns to manage context complexity
Subjectivity in human annotation of hallucinations (moderate IAA of 0.591)

Reproducibility

Code: https://github.com/TAI-HAMBURG/AuthenHallu

Data and code publicly available at https://github.com/TAI-HAMBURG/AuthenHallu. The source dataset (LMSYS-Chat-1M) is also public. Annotation guidelines and process are described.

📊 Experiments & Results

Evaluation Setup

Statistical analysis of hallucination distribution and evaluation of vanilla LLMs as detectors

Benchmarks:

AuthenHallu (Hallucination Detection) [New]

Metrics:

Hallucination Rate (%)
Distribution of Hallucination Categories
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Statistical analysis of the AuthenHallu benchmark reveals high rates of naturally occurring hallucinations.
AuthenHallu	Hallucination Rate (Overall)	Not reported in the paper	31.4	Not reported in the paper
AuthenHallu	Math & Number Problems Hallucination Rate	Not reported in the paper	60.0	Not reported in the paper
AuthenHallu	Dates, Time & Calendar Information Hallucination Rate	Not reported in the paper	60.0	Not reported in the paper
AuthenHallu	Proportion of Fact-conflicting Hallucinations	Not reported in the paper	62.5	Not reported in the paper

Experiment Figures

Bar chart displaying hallucination rates (%) across the top 10 topics.

Main Takeaways

Real-world hallucinations are frequent (31.4%) and heavily skewed towards factual errors (62.5%) rather than context or input conflicts.
Topic analysis shows LLMs struggle most with quantitative reasoning (Math, Time), contradicting the assumption that hallucinations are purely 'creative' errors.
Vanilla LLMs are currently insufficient as reliable detectors for these authentic errors, highlighting a gap between model capabilities and self-correction needs.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM Hallucination (factuality vs. faithfulness)
Familiarity with dataset construction (clustering, sampling)
Knowledge of existing benchmarks (HaluEval, FELM)

Key Terms

AuthenHallu: The proposed benchmark dataset constructed from real-world LLM-human dialogues

LMSYS-Chat-1M: A large-scale dataset of 1 million real-world conversations between humans and various LLMs

Hallucination: LLM outputs that are incorrect or inconsistent with the context or user input

Deliberately Induced Generation: A data creation strategy where models are explicitly prompted to generate incorrect information (e.g., HaluEval)

Simulated Interactive Generation: A strategy where queries are collected/crafted and responses generated, but interactions are not from real users (e.g., FELM)

Fact-conflicting: A hallucination category where the output contradicts established world knowledge

Input-conflicting: A hallucination category where the output contradicts the user's explicit input prompt

Context-conflicting: A hallucination category where the output contradicts previous turns in the dialogue history

Vanilla LLM: Using a standard, off-the-shelf Large Language Model without additional fine-tuning or external tools

IAA (Inter-Annotator Agreement): A statistical measure (like Fleiss's Kappa) evaluating how consistently different human annotators assign labels