Distinguishing Ignorance from Error in LLM Hallucinations

📝 Paper Summary

Hallucination detection Hallucination mitigation

The authors distinguish between hallucinations caused by lack of knowledge (HK-) and those occurring despite having knowledge (HK+), showing that detection and mitigation improve when treating them separately.

Core Problem

Current hallucination research often conflates two distinct failure modes: cases where the model lacks the information (ignorance) and cases where it knows the answer but still answers incorrectly (error).

Why it matters:

Different hallucination types require different solutions: ignorance requires external retrieval or abstention, while errors despite knowledge might be fixed via prompting or steering
Treating all hallucinations as a single class hinders the performance of detectors and mitigation strategies
Model-specific nuances in knowledge are often ignored by generic hallucination datasets

Concrete Example: A model might know the answer to 'Who won the 1990 World Cup?' (West Germany) and answer correctly in a neutral setting, but when prompted with a 'Snowballing' context containing prior mistakes, it hallucinates a wrong answer (HK+). This is fundamentally different from not knowing the winner at all (HK-).

Key Novelty

WACK (Wrong Answers despite Correct Knowledge) Framework

Classifies hallucinations into HK- (model lacks knowledge) and HK+ (model has knowledge but fails), using multiple sampling attempts to determine knowledge status
Intentionally induces HK+ hallucinations in knowledgeable models using progressively stronger prompt perturbations (e.g., 'Snowballing' with prior errors, 'Alice-Bob' persuasion)
Demonstrates that linear probes on internal states can distinguish between these two types of hallucinations

Architecture

The WACK data construction pipeline

Evaluation Highlights

HK+ hallucinations are prevalent, occurring in 4%–24% of high-knowledge cases across models (Llama-3, Mistral, Gemma-2) and datasets
A classifier trained to detect HK+ generalizes across different prompt settings (e.g., training on 'Snowballing' detects 'Alice-Bob' errors with AUC > 0.85)
Model-specific WACK datasets outperform generic datasets for HK+ detection, improving AUC by up to ~10-15 percentage points in some settings

Breakthrough Assessment

7/10

Provides a crucial conceptual distinction (HK+ vs HK-) backed by a solid automated framework for dataset creation. The finding that 'errors despite knowledge' are distinct and detectable is valuable for future mitigation strategies.

⚙️ Technical Details

Problem Definition

Setting: Closed-book Question Answering (CBQA) with short answers

Inputs: Natural language question q with a known gold answer a_g

Outputs: Generated answer a_tilde which is either correct or a hallucination

Pipeline Flow

Knowledge Assessment (Classify Q as Low-Knowledge or High-Knowledge)
HK- Labeling (Low-Knowledge examples labeled HK-)
HK+ Induction (Apply perturbation prompts to High-Knowledge examples)
Final Dataset Compilation (Correct, HK-, HK+)

System Modules

Knowledge Assessor (Data Construction)

Determine if model has parametric knowledge of the answer

Model or implementation: Target LLM (e.g., Llama-3.1-8B)

HK+ Inducer (Data Construction)

Induce hallucinations in high-knowledge examples using specific prompts

Model or implementation: Target LLM

Novel Architectural Elements

WACK framework pipeline for systematically constructing HK+ examples by verifying knowledge first, then attacking it with mild-to-strong perturbations

Modeling

Base Model: Evaluated on Mistral-7B-v0.3, Llama-3.1-8B, Gemma-2-9B

Training Method: Training linear probes (classifiers) on fixed model activations

Objective Functions:

Purpose: Train probes to distinguish hallucination types.

Formally: Logistic Regression on hidden states.

Training Data:

TriviaQA and NaturalQuestions datasets
Data split into train/test for the probes (exact split sizes not explicitly detailed in summary but standard practice implies hold-out sets)

Key Hyperparameters:

sampling_temperature: 0.5 (for knowledge assessment)
num_samples_knowledge_check: 5
few_shot_k: 3

Compute: Not reported in the paper

Comparison to Prior Work

vs. Generic detection: WACK explicitly separates HK- and HK+, showing they have different internal representations
vs. Existing model-specific datasets: Prior work doesn't verify if the model 'knows' the answer before labeling a hallucination, potentially conflating ignorance with error

Limitations

Focuses only on the two ends of the knowledge spectrum (known vs. unknown), ignoring the 'middle' spectrum of partial knowledge
Relies on exact match for answer verification, which might miss semantically correct but lexically different answers
Analysis restricted to short-answer CBQA; applicability to long-form generation is not tested

Reproducibility

Code: https://github.com/technion-cs-nlp/hallucination-mitigation

Code and datasets are available at https://github.com/technion-cs-nlp/hallucination-mitigation. The paper provides detailed prompt templates in the Appendix.

📊 Experiments & Results

Evaluation Setup

Closed-book QA on TriviaQA and NaturalQuestions

Benchmarks:

TriviaQA (CBQA)
NaturalQuestions (CBQA)

Metrics:

AUC (Area Under ROC Curve)
Percentage of HK+ occurrence
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Prevalence of HK+ hallucinations across models and settings.
TriviaQA/NQ	HK+ Rate	0	24	+24
Classification performance distinguishing HK+ from HK- using linear probes.
TriviaQA	AUC	0.50	0.85	+0.35
Impact of model-specific datasets on detection performance.
TriviaQA	AUC	0.62	0.76	+0.14

Main Takeaways

HK+ hallucinations are distinct from HK-: mitigation strategies like prompting work for HK+ but not HK-
Simple linear classifiers on internal states can distinguish HK+ from HK- with high accuracy
HK+ patterns generalize: a detector trained on 'Snowballing' errors can detect 'Alice-Bob' errors
Model-specific datasets are crucial: knowledge boundaries vary by model, so generic datasets confuse ignorance and error

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and hallucination
Familiarity with in-context learning and prompting
Basic knowledge of probing classifiers (linear probes on hidden states)

Key Terms

HK+: Hallucination despite Knowledge—the model outputs an incorrect answer even though it contains the correct knowledge in its parameters

HK-: Hallucination due to Knowledge deficiency—the model outputs an incorrect answer because it does not possess the required knowledge

WACK: Wrong Answers despite Correct Knowledge—the proposed automatic framework for generating model-specific datasets containing HK+ and HK- examples

CBQA: Closed-Book Question Answering—answering questions without access to external documents

Snowballing: A phenomenon where a model's prior mistakes (or mistakes in the prompt context) lead to further incorrect generations

Greedy decoding: A decoding strategy where the model always selects the token with the highest probability

Linear probe: A simple linear classifier trained on the internal activations (hidden states) of a neural network to predict a specific property

AUC: Area Under the ROC Curve—a performance metric for classification tasks, where 1.0 is perfect and 0.5 is random guessing