Rethinking Hallucinations: Correctness, Consistency, and Prompt Multiplicity

📝 Paper Summary

Hallucination evaluation Prompt sensitivity

LLMs exhibit severe inconsistency (prompt multiplicity) where prompt variations alter answers despite stable overall accuracy, revealing that many hallucinations are random guesses rather than persistent misconceptions.

Core Problem

Current hallucination benchmarks focus only on correctness (accuracy) and overlook consistency, masking the difference between random guessing and persistent misinformation.

Why it matters:

Random guessing (prompt-sensitive errors) erodes trust and causes confusion but can be managed with uncertainty estimation
Persistent errors (prompt-agnostic errors) spread misinformation and require external fact-checking or data filtering
Existing benchmarks treat both error types identically, leading to a severe misunderstanding of the true harms and appropriate mitigation strategies

Concrete Example: On Med-HALT, Llama3-8B selects 'Tetracycline' (wrong) for one prompt but 'Ibuprofen' (wrong) or 'Amoxicilin' (wrong) when MCQ options are shuffled. Conversely, Llama3-8B-Instruct consistently selects 'Tetracycline'. Benchmarks score both as 'incorrect', missing that the first is random noise and the second is a dangerous, persistent medical error.

Key Novelty

Prompt Multiplicity Framework for Hallucinations

Formalize consistency using 'prompt multiplicity'—the phenomenon where models with similar accuracy give conflicting predictions for individual questions based on prompt structure
Decompose 'hallucinations' into 'randomness' (prompt-sensitive) and 'prompt-agnostic errors' (persistent), and 'factuality' into 'prompt-agnostic factuality' and 'prompt-sensitive correct guesses'
Demonstrate that detection methods align with consistency rather than correctness, and that RAG introduces new inconsistency via retrieval sensitivity

Architecture

A taxonomy mapping traditional 'Factuality' and 'Hallucinations' to 'Prompt-agnostic Factuality', 'Randomness', and 'Prompt-agnostic Errors' based on prompt multiplicity.

Evaluation Highlights

Over 50% ambiguity (inconsistency) observed in benchmarks like Med-HALT for models like Llama2-13B-Chat, despite low accuracy variance (<0.5%)
Detection methods (Perplexity, SelfCheck) show high statistical significance (p < .001) when distinguishing consistent vs. inconsistent answers, but fail to distinguish correct vs. incorrect answers (p > .05 for Perplexity on Wiki-FACTOR)
RAG mitigation reduces overall errors but introduces high retrieval ambiguity: >90% of random errors in FEVER with RAG stem from the retriever selecting different documents for different prompts

Breakthrough Assessment

8/10

Significantly reframes the understanding of hallucinations by introducing consistency as a critical dimension. Exposes major flaws in current benchmarking and detection assumptions.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of Large Language Models (LLMs) on multiple-choice question (MCQ) factuality benchmarks under prompt variations

Inputs: A set of factual questions X, a set of equivalent prompt structures P (e.g., shuffled options, paraphrasing), and an LLM G

Outputs: Classification of each generation as prompt-agnostic factuality, prompt-agnostic error, or randomness based on self-consistency

Pipeline Flow

Input Question x_k
Apply Prompt Variations P (shuffle options, shuffle shots, paraphrase)
Model Generation G(p_i(x_k)) for each variation
Compute Ambiguity & Self-Consistency
Categorize into Prompt-Agnostic (Factuality/Error) or Randomness

System Modules

Prompt Variator (Evaluation Setup)

Generate equivalent prompt structures for the same question

Model or implementation: Rules (shuffling) or T5-Paraphraser

Target LLM (Evaluation Setup)

Generate answers for each prompt variation

Model or implementation: Various (Llama-2, Llama-3, GPT-J, etc.)

Consistency Analyzer

Calculate multiplicity metrics and categorize harms

Model or implementation: Statistical formulas (Definition 1-5)

Novel Architectural Elements

Integration of multiplicity metrics (ambiguity, self-consistency) directly into the hallucination evaluation pipeline to distinguish error types

Modeling

Base Model: Llama-2 (7B/13B), Llama-3 (8B), GPT-J-6B, GPT-NeoX-20B, Pythia (2.8B-12B), Bloom (3B/7.1B), OPT (6.7B-30B)

Comparison to Prior Work

vs. Standard Accuracy: Adds consistency dimension to distinguish random guesses from persistent errors
vs. Uncertainty Estimation: Frames inconsistency as a frequentist property (multiplicity) rather than information-theoretic (entropy), directly linking it to real-world harms
vs. Predictive Multiplicity: Adapts multiplicity from 'across models' to 'across prompts within one model'

Limitations

Scope limited to multiple-choice question (MCQ) benchmarks; free-form generation is harder to evaluate for multiplicity
Reliance on baseline detection techniques (Perplexity, Entropy) rather than SOTA methods
Automated paraphrasing for Wiki-FACTOR was not explicitly validated for semantic preservation (though a specialized model was used)

📊 Experiments & Results

Evaluation Setup

Zero-shot or few-shot MCQ answering across 8 benchmarks

Benchmarks:

Med-HALT (Medical domain hallucination test (MCQ))
TruthfulQA (Misconception/myth detection (MCQ))
Wiki-FACTOR (Factual knowledge verification)

Metrics:

Accuracy (per-question and average)
Ambiguity (percentage of questions with inconsistent answers)
Self-consistency score
Prompt-agnostic Factuality/Error rates
Randomness rate
Statistical methodology: Wilcoxon signed-rank test for comparing detection scores across categories

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Ambiguity analysis reveals high inconsistency despite stable accuracy across major models.
Med-HALT	Ambiguity	0	60.54	+60.54
Med-HALT	Ambiguity	0	52.26	+52.26
Decomposition of errors shows different benchmarks suffer from different types of hallucination harms.
TruthfulQA	Randomness vs Prompt-Agnostic Errors	Not reported in the paper	Not reported in the paper	Not reported in the paper
Med-HALT	Randomness vs Prompt-Agnostic Errors	Not reported in the paper	Not reported in the paper	Not reported in the paper
Hallucination detection techniques are shown to detect consistency, not correctness.
Wiki-FACTOR	p-value (Correct vs Incorrect)	0.05	0.404	+0.354
Wiki-FACTOR	p-value (Consistent vs Inconsistent)	0.05	0.000	-0.05
RAG mitigation improves factuality but shifts errors toward randomness due to retriever instability.
FEVER	Ambiguity over Retrieved Docs (for Randomness category)	Not reported in the paper	94.67	Not reported in the paper

Experiment Figures

Bar charts decomposing model outputs on TruthfulQA, Wiki-FACTOR, and Med-HALT into the three new categories (Factuality, Randomness, Errors).

Main Takeaways

Standard accuracy metrics grossly overestimate model capabilities; 'Prompt-Agnostic Factuality' (consistently correct answers) is often 15-20% lower than reported accuracy.
Benchmarks with similar accuracy can have vastly different error profiles: TruthfulQA reveals persistent myths (requires fact-checking), while Med-HALT reveals random guessing (requires uncertainty estimation).
Hallucination detection methods (Perplexity, Entropy, SelfCheck) function as consistency detectors, not correctness detectors; they fail to flag consistently incorrect hallucinations.
RAG reduces prompt-agnostic errors but significantly increases randomness because the retriever itself is sensitive to prompt phrasing (retrieving different docs >80% of the time for unstable questions).

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of LLM evaluation benchmarks (TruthfulQA, etc.)
Familiarity with the concept of Hallucinations in LLMs
Understanding of Predictive Multiplicity from fairness/ML literature

Key Terms

prompt multiplicity: The phenomenon where competing prompt structures yield similar aggregate accuracy but generate conflicting individual predictions for the same question

ambiguity: The proportion of questions in a benchmark where the model outputs different choices depending on the prompt structure

self-consistency: For a specific question, the probability of getting the same output choice from two randomly chosen prompt structures

prompt-sensitive: A generation is prompt-sensitive if its self-consistency score is below a threshold (indicating randomness)

prompt-agnostic: A generation is prompt-agnostic if its self-consistency score is above a threshold (indicating persistent behavior)

RAG: Retrieval-Augmented Generation—systems that fetch external documents to ground answers

Med-HALT: A medical domain hallucination benchmark

TruthfulQA: A benchmark designed to measure whether language models generate falsehoods mimicking human misconceptions

predictive multiplicity: A concept from ML fairness where models with equal accuracy have different individual predictions; here adapted to prompt variations