Do LLMs Know about Hallucination? An Empirical Investigation of LLM's Hidden States

📝 Paper Summary

Hallucination detection Interpretability / Mechanistic Interpretability Internal state analysis

LLMs encode distinct patterns in their final hidden states when processing hallucinated versus correct answers, which can be exploited to detect and mitigate hallucinations via activation engineering.

Core Problem

It remains unclear whether LLMs are internally aware of their own hallucinations or if they generate falsehoods indistinguishably from truths.

Why it matters:

Hallucinations prevent safe integration of LLMs into critical downstream applications.
Existing detection methods often rely on external knowledge or complex uncertainty metrics, whereas internal state awareness could offer a more direct detection mechanism.
Understanding the mechanism of hallucination generation is essential for developing robust mitigation strategies.

Concrete Example: When an LLM receives the input 'Question: What happens to you if you eat watermelon seeds? Answer: You grow watermelons in your stomach.', its internal hidden state should ideally react differently than if the answer were 'Nothing happens.' If the hidden states are identical, the model lacks awareness of the falsehood.

Key Novelty

Hidden State Awareness & Directional Mitigation

Proposes an experimental framework comparing hidden states of the *same* question paired with either a correct or hallucinated answer to isolate truthfulness encoding.
Identifies specific 'directions' in the hidden representation space (via PCA on state transitions) that correspond to truthfulness.
Uses these directions for 'activation engineering'—adding a vector to the hidden state during inference to steer the model away from hallucinations.

Architecture

Experimental framework for extracting hidden states. Shows two parallel passes: one with a correct answer, one with a hallucinated answer.

Evaluation Highlights

Significant statistical difference in hidden state cosine similarity when processing correct vs. hallucinated answers (p < 0.01 across LLaMA-2 7B, 13B, and Chat variants).
Middle transformer layers (10-20) are most effective at distinguishing hallucinations, while early layers (<10) contribute little.
Incorporating reference knowledge significantly increases the 'awareness score' (differentiation between correct/hallucinated states), particularly for LLaMA-2-Chat 7B (t-statistic 16.71).

Breakthrough Assessment

7/10

Provides strong empirical evidence of internal hallucination awareness and demonstrates a novel mitigation strategy using activation engineering. However, the mitigation results are qualitative case studies rather than large-scale quantitative benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Analyzing the hidden states of autoregressive LLMs during the processing of forced inputs (teacher-forcing style) containing either correct or hallucinated answers.

Inputs: A pair of inputs: (1) Question + Correct Answer, (2) Question + Hallucinated Answer.

Outputs: Hidden states s1 (end of question), s2 (end of hallucinated answer), and s3 (end of correct answer).

Pipeline Flow

Input Construction (Create pairs: Q+Correct, Q+Hallucinated)
Forward Pass (Run LLM on both inputs independently)
State Extraction (Extract s1, s2, s3 from final layer)
Analysis (Compute cosine similarity, PCA directions)
Mitigation (Add 'correct direction' vector to hidden states during generation)

System Modules

Hidden State Extractor (Analysis)

Extracts the final layer hidden states at specific token positions

Model or implementation: LLaMA-2 family

Direction Calculator (Analysis)

Identifies the 'truthfulness' direction in latent space

Model or implementation: PCA on difference vectors

Novel Architectural Elements

Experimental framework comparing paired hidden states (correct vs. hallucinated) conditioned on identical questions to isolate truthfulness features
Use of attention-blocking to measure 'effect size' of the question component on the final hidden state

Modeling

Base Model: LLaMA-2 7B, LLaMA-2-Chat 7B, LLaMA-2 13B

Training Method: Inference-only analysis (no training)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Azaria and Mitchell (2023): This paper focuses on the *mechanism* (transition vectors) and *mitigation* (steering) rather than just training a binary classifier on states.
vs. Uncertainty-based detection: This approach uses internal representations rather than output probability distributions.
vs. ITI (Inference-Time Intervention) [not cited in paper]: Similar goal of shifting activations, but this paper derives the steering vector from the difference between forced correct/hallucinated answer states.

Limitations

Does not differentiate between fine-grained hallucination categories (e.g., factual fabrication vs. logical inconsistency).
Focuses primarily on the final layer's hidden state, exploring intermediate layers less comprehensively.
Experiments are limited to conventional QA tasks; strictly domain-specific tasks or reasoning tasks are not tested.
Mitigation results are presented as case studies rather than comprehensive quantitative benchmarks.

Reproducibility

Code: https://github.com/huggingface/transformers

The paper states experimental code will be released. It uses open datasets (TruthfulQA, HaluEval) and open models (LLaMA-2). The methodology for calculating awareness scores and PCA directions is described mathematically.

📊 Experiments & Results

Evaluation Setup

Teacher-forcing inputs (Question + Answer) through LLaMA-2 models to measure internal state reactions.

Benchmarks:

TruthfulQA (QA measuring mimicry of human falsehoods)
HaluEval (Hallucination evaluation (subset of 10K samples based on HotpotQA))

Metrics:

Awareness Score (cosine similarity difference)
Effect Size (L2 norm of hidden state difference with/without attention blocking)
Projection Scalar (projection of transition vector onto truthfulness direction)
Statistical methodology: Pairwise one-tailed t-tests (p-values reported)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
TruthfulQA	Awareness Score	0	0.051	+0.051
TruthfulQA	Awareness Score	0	0.039	+0.039
HaluEval	Awareness Score	0.044	0.137	+0.093
TruthfulQA	Awareness Score	0.051	0.098	+0.047
TruthfulQA	Awareness Score	0.051	0.012	-0.039

Experiment Figures

Regression plots showing the correlation between the calculated 'awareness score' and the projection of transition vectors onto the principal component directions.

The difference in 'effect size' (impact of blocking attention to the question) between correct and hallucinated inputs across layers.

Main Takeaways

LLMs possess a measurable 'awareness' of hallucination: the final hidden state changes significantly more when processing a correct answer than a hallucinated one.
The 'awareness' correlates with model uncertainty and can be manipulated via prompting (pro-prompting increases it, anti-prompting decreases it).
Information from the question is crucial for correct answers (high effect size), but hallucinated answers rely much less on the question (low effect size).
Middle layers (10-20) are the 'sweet spot' for detecting hallucinations; early layers lack context, and later layers are less effective differentiators.
Adding a 'correct direction' vector to the hidden state (Activation Engineering) can qualitatively steer the model to correct its own hallucinations (shown in case studies).

📚 Prerequisite Knowledge

Prerequisites

Understanding of Transformer architecture (hidden states, layers)
Cosine similarity
Principal Component Analysis (PCA)
Activation Engineering / Steering Vectors

Key Terms

final hidden state: The hidden state of the last input token in the last transformer layer, which directly influences the next token prediction.

awareness score: A metric defined as cos(s1, s2) - cos(s1, s3), quantifying how differently the model's state changes when exposed to a hallucinated vs. correct answer.

activation engineering: A technique involving the addition of specific vectors (offsets) to the hidden states of a model during inference to steer its behavior (e.g., towards truthfulness).

PCA: Principal Component Analysis—a dimensionality reduction technique used here to find the primary 'direction' of difference between correct and hallucinated hidden state transitions.

teacher-forcing: Feeding the model the actual ground truth (or specific target text) as input history, rather than its own generated output, to inspect its internal reaction to that specific text.

pro-prompting: Using prompts that encourage confidence for correct inputs and discourage confidence for hallucinated inputs to test awareness sensitivity.

anti-prompting: Using prompts that discourage confidence for correct inputs and encourage confidence for hallucinated inputs.