Probing LLM Hallucination from Within: Perturbation-Driven Approach via Internal Knowledge

📝 Paper Summary

Hallucination detection Factuality

SHINE detects hallucinations by perturbing key entities in prompts and analyzing the resulting variance in the LLM's token generation probabilities to distinguish between faithful, misaligned, and fabricated responses.

Core Problem

Current hallucination detection methods rely on external knowledge (unavailable in some settings), fine-tuning (expensive), or uncertainty metrics that fail to distinguish between lack of knowledge (fabrication) and reasoning errors (misalignment).

Why it matters:

External knowledge sources like Wikipedia may not exist for private data or specialized domains.
Simple uncertainty metrics conflate randomness with actual hallucinations, leading to false positives.
Distinguishing between 'fabrication' (no knowledge) and 'misalignment' (has knowledge but contradicts it) is crucial for accurate detection but ignored by binary classifiers.

Concrete Example: When an LLM is asked about a non-existent medical condition, it might confidently generate a fake definition (fabrication). Standard uncertainty methods might miss this if the model is confident in its next-token prediction, whereas consistency checks might fail if the model consistently hallucinates the same wrong fact.

Key Novelty

Systematic Hallucination Inspection with Noisy Entity (SHINE)

Introduces 'Hallucination Probing': a 3-way classification task (aligned, misaligned, fabricated) instead of binary detection.
Discovers 'Entity Perturbation Impact': fabricated text is insensitive to entity noise (low impact), while aligned text is sensitive (high impact). Misaligned text shows increased probability under noise.
Uses a two-stage test: first checks if the model knows the concept (Model Knowledge Test), then checks if the output matches that knowledge (Alignment Test).

Architecture

The SHINE workflow diagram, detailing the two-stage process: Model Knowledge Test followed by Alignment Test.

Evaluation Highlights

Outperforms 7 competing methods across 4 datasets (TriviaQA, SQuAD, NQ, TruthfulQA) and 4 LLMs.
Achieves state-of-the-art AUC scores, e.g., 0.88 AUC on TriviaQA with LLaMA2-13B-Chat, surpassing SelfCheckGPT (0.81).
Effective without external knowledge, supervised training, or LLM fine-tuning.

Breakthrough Assessment

8/10

Significant shift from binary detection to 3-way probing using internal mechanics (perturbation) rather than external lookups. High performance gains without training make it very practical.

⚙️ Technical Details

Problem Definition

Setting: Classify an LLM-generated text G given prompt P into three categories: Aligned, Misaligned, or Fabricated.

Inputs: Prompt P, Generated response G, Access to LLM logits (white-box)

Outputs: Class label: {Aligned, Misaligned, Fabricated}

Pipeline Flow

Input Processing: Extract entities from prompt P using SpaCy
Model Knowledge Test (MKT): Perturb entity embeddings → Measure KL Divergence → Detect Fabrication
Alignment Test (AT): If not fabricated, perturb again (lower noise) → Measure probability change (Delta P) → Detect Misalignment

System Modules

Entity Extractor

Identify key entities (Subject, Object) in the prompt to target for perturbation

Model or implementation: SpaCy (en_core_web_sm)

Model Knowledge Test (Classification)

Determine if the model has knowledge of the entities. Low sensitivity to noise = Fabricated.

Model or implementation: Target LLM (e.g., LLaMA2)

Alignment Test (Classification)

Determine if the response aligns with internal knowledge. Increased probability under noise = Misaligned.

Model or implementation: Target LLM (e.g., LLaMA2)

Novel Architectural Elements

Two-stage pipeline separating knowledge existence check (MKT) from alignment check (AT).
Perturbation mechanism that injects Gaussian noise directly into embedding vectors of specific entities based on their attention weights.

Modeling

Base Model: LLaMA2-13B-Chat, LLaMA3-8B-Instruct, Mistral-7B-Instruct, Qwen2.5-7B-Instruct

Comparison to Prior Work

vs. SelfCheckGPT: SHINE analyzes internal state changes (perturbation) rather than just output consistency.
vs. Predictive Uncertainty: SHINE distinguishes between uncertainty due to ignorance (fabrication) vs. confusion/errors (misalignment).
vs. SAPLMA [not cited in paper]: SAPLMA uses internal activations to detect factual errors; SHINE actively perturbs inputs to probe stability.

Limitations

Requires white-box access to the model (embeddings and logits), making it inapplicable to closed APIs like GPT-4.
Computational cost involves multiple forward passes (10 repeats for perturbation) plus potential SelfCheckGPT fallback.
Relies on accurate entity extraction; if SpaCy misses the key entity, the perturbation target is lost.
Hyperparameters for noise levels and thresholds need calibration on a validation set.

Reproducibility

Code: https://github.com/poloclub/SHINE

Code is publicly available at https://github.com/poloclub/SHINE. Method relies on white-box access to calculate gradients/logits for attention and probability shifts. Hyperparameters (noise level sigma) are provided in the paper.

📊 Experiments & Results

Evaluation Setup

Hallucination detection on QA and text generation tasks.

Benchmarks:

TriviaQA (Question Answering)
SQuAD (Reading Comprehension/QA)
Natural Questions (NQ) (Open-domain QA)
TruthfulQA (Factuality evaluation)

Metrics:

AUROC (Area Under Receiver Operating Characteristic Curve)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
SHINE achieves state-of-the-art hallucination detection performance across multiple datasets and models compared to unsupervised baselines.
TriviaQA	AUROC	0.81	0.88	+0.07
SQuAD	AUROC	0.78	0.82	+0.04
TruthfulQA	AUROC	0.68	0.83	+0.15
TriviaQA	AUROC	0.73	0.88	+0.15

Experiment Figures

Distribution of KL Divergence and Delta P for Aligned, Misaligned, and Fabricated text.

Main Takeaways

Perturbing key entities reveals distinct patterns for hallucinations: Fabricated text has low KL divergence (insensitive to input), while Aligned text changes significantly.
Misaligned text (contradictions) often shows *increased* probability for generated tokens when noise is added (positive Delta P).
The method generalizes well across different LLM architectures (LLaMA2, LLaMA3, Mistral, Qwen) without retraining.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM token generation (logits, probabilities)
Concept of embedding space and vector perturbation
Kullback-Leibler (KL) Divergence

Key Terms

Hallucination Probing: A task classifying text into Aligned (faithful), Misaligned (has knowledge but contradicts), and Fabricated (lacks knowledge).

Entity Perturbation Impact: The change in an LLM's output distribution when noise is added to the embedding of key entities in the prompt.

Model Knowledge Score: A metric based on KL divergence measuring how much the output distribution changes after perturbing entities; low scores indicate fabrication.

Alignment Score: A metric measuring the change in probability of the generated tokens after perturbation; positive changes suggest misalignment.

SelfCheckGPT: A baseline method that checks consistency across multiple sampled generations to detect hallucinations.

KL divergence: A statistical distance measure quantifying how one probability distribution differs from a second, reference probability distribution.