Black-Box Hallucination Detection via Consistency Under the Uncertain Expression

📝 Paper Summary

Hallucination Detection Uncertainty Estimation Black-Box Methods

The paper proposes a hallucination detection metric that measures the consistency between a standard LLM response and a response generated with a prompt explicitly expressing uncertainty, finding that factual answers remain consistent while hallucinations shift.

Core Problem

LLMs frequently generate non-factual 'hallucinated' responses, but existing detection methods often require restricted internal states (like token probabilities) or expensive external resources (like Wikipedia retrieval).

Why it matters:

Real-world LLM deployments often expose only black-box APIs, making white-box detection methods using internal probabilities unusable.
External knowledge bases may lack coverage for long-tail queries or specific domains.
Existing black-box methods (e.g., SelfCheckGPT) are computationally expensive, requiring ~10 sampled responses to estimate consistency.

Concrete Example: When asked 'How many acts is the ballet Rita Sangalli premiered in...?', a model might confidently hallucinate an answer. The proposed method detects this by re-asking with 'I am not sure but it could be', causing the model to change its answer if the original was a hallucination.

Key Novelty

Consistency Under Uncertain Expression (Black-Box Metric)

Leverages the observation that LLMs are 'consistent' when factual but 'inconsistent' when hallucinating if forced to express uncertainty.
Uses a single additional inference step with a prompt like 'Answer: I am not sure but it could be' to trigger this divergence for non-factual answers.
Avoids the high cost of multiple sampling (e.g., 10+ generations) required by previous black-box consistency checks.

Architecture

Conceptual workflow of the proposed detection method (text-based reconstruction)

Evaluation Highlights

Factual responses show high consistency (~87-90%) regardless of the prompt expression used.
Non-factual responses show significantly lower consistency (~32-48%) when prompted with uncertainty or certainty expressions.
The method effectively discriminates factuality using just two responses (original + perturbed), whereas baselines often need 10+ samples.

Breakthrough Assessment

7/10

Simple, effective, and efficient. It significantly reduces the compute cost of black-box hallucination detection (2 calls vs 10+) while offering a novel prompt-based mechanism.

⚙️ Technical Details

Problem Definition

Setting: Closed-book Question Answering (QA) where the goal is to classify a generated response as Factual or Non-Factual.

Inputs: A question q and an initial generated answer a_1 from an LLM.

Outputs: A binary factuality prediction based on consistency with a second answer a_2.

Pipeline Flow

Standard Generation (Get Reference Response)
Perturbed Generation (Get Response with Uncertain Prompt)
Consistency Check (Compare Responses)

System Modules

Standard Generator (Generation)

Generate the baseline response to be evaluated.

Model or implementation: text-davinci-003

Perturbed Generator (Generation)

Generate a second response using a prefix that expresses (un)certainty.

Model or implementation: text-davinci-003

Consistency Checker

Determine if a_1 and a_2 are semantically consistent.

Model or implementation: Human annotators (in paper evaluation)

Novel Architectural Elements

Use of 'uncertainty injection' via prompt prefixes (e.g., 'I am not sure but...') as a discriminator for factuality.

Modeling

Base Model: GPT-3 (text-davinci-003)

📊 Experiments & Results

Evaluation Setup

Zero-shot closed-book QA on HotpotQA and NQ-open datasets.

Benchmarks:

HotpotQA (Multi-hop QA)
NQ-open (Open-domain QA)

Metrics:

Accuracy (Factuality of response)
Consistency (%)
Log Probability Ratio (log p2/p1)
Entropy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Analysis of consistency shows that factual answers remain stable under prompt perturbation, while non-factual answers fluctuate significantly.
HotpotQA	Consistency (Factual Group)	Not reported in the paper	87.53	Not reported in the paper
HotpotQA	Consistency (Non-Factual Group)	Not reported in the paper	45.19	Not reported in the paper
NQ-open	Consistency (Factual Group)	Not reported in the paper	90.3	Not reported in the paper
NQ-open	Consistency (Non-Factual Group)	Not reported in the paper	43.28	Not reported in the paper

Experiment Figures

Distribution of log probability ratios for factual vs. non-factual groups

AUROC curves for detecting non-factual classes among the consistent group

Main Takeaways

Factual responses are robust to prompt perturbations: they remain consistent ~88-90% of the time even when the model is prompted with uncertainty.
Non-factual (hallucinated) responses are fragile: consistency drops to ~32-48% when prompt perturbations are applied.
Surprisingly, expressions of uncertainty ('I am not sure...') and certainty ('It must be...') yield similar discrimination power for consistency.
The log probability ratio drops more for factual samples than non-factual ones in some settings, but consistency is the primary discriminator.

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of Large Language Models (LLMs) and prompting.
Familiarity with the concept of hallucination in NLG.
Understanding of entropy and probability in language generation.

Key Terms

Hallucination: Generations by an LLM that are non-factual or ungrounded in reality.

Black-Box: Methods that only utilize the input and output text of a model, without access to weights or internal probabilities.

Consistency: The degree to which multiple responses from an LLM to the same (or similar) prompt agree semantically.

NLI: Natural Language Inference—determining if one sentence entails, contradicts, or is neutral towards another.

Entropy: A measure of uncertainty or randomness in the model's output distribution.

SelfCheckGPT: A baseline method that samples multiple responses to check for consistency [implied as the multi-sample baseline].

Greedy decoding: A decoding strategy where the model always picks the highest probability token at each step.