LLM Internal States Reveal Hallucination Risk Faced With a Query

📝 Paper Summary

Hallucination detection Uncertainty estimation Mechanistic interpretability

LLMs possess an internal 'self-awareness' of whether they have seen a query during training and their likelihood of hallucinating, which can be predicted using a lightweight probe on the last token's hidden states.

Core Problem

LLMs often generate confident but factually incorrect responses (hallucinations) because they lack an explicit mechanism to express uncertainty or recognize when a query falls outside their training data knowledge.

Why it matters:

LLMs are unreliable in real-world applications because they tend to be overconfident even when confabulating.
Existing hallucination detection methods often rely on external reference texts or heavy sampling, rather than leveraging the model's own internal signals.
Detecting hallucination risk *before* generation is critical for triggering mitigation strategies like refusal or Retrieval-Augmented Generation (RAG).

Concrete Example: When asked about a news event from 2024 (unseen during training), a 2023-era LLM might confidently fabricate details instead of refusing. This paper shows the model's internal states actually encode the 'unseen' nature of the query, even if the generated text doesn't reflect it.

Key Novelty

Internal State Probing for Hallucination Risk

Identifies specific neurons in the last layer of LLMs that activate differently for 'seen' vs. 'unseen' concepts and for high vs. low hallucination risk.
Trains a lightweight MLP probe on the hidden states of the *query's last token* to predict hallucination status before the model generates a single word of the response.

Architecture

Conceptual comparison between human cognitive processes (recognizing unknown queries) and the proposed LLM internal state analysis. It illustrates the pipeline: Query -> LLM -> Internal State Extraction -> Estimator -> Risk Label.

Evaluation Highlights

Achieves 84.32% average accuracy in estimating hallucination risk across 15 NLG tasks using Llama-2-7B's internal states.
Detects whether a query was seen during training (2020 vs. 2024 news) with 80.28% accuracy.
Outperforms perplexity-based baselines and self-check prompts (which ask the model 'can you answer this?') by significant margins.

Breakthrough Assessment

7/10

Strong empirical evidence that LLMs 'know what they don't know' at the representation level. The method is efficient (run-time probing) and effective, though primarily tested on Llama-2.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of hallucination risk based on query representations

Inputs: Internal hidden states I_{theta, q} derived from the LLM f_theta processing query q (specifically activations of the last token)

Outputs: Predicted hallucination risk label h (0 or 1)

Pipeline Flow

Input Query Processing (LLM)
Feature Extraction (Internal States)
Risk Estimation (Probing)

System Modules

LLM Backbone

Process the input query and generate internal representations

Model or implementation: Llama-2-7B (also tested on Mistral-7B in secondary experiments)

Hallucination Estimator

Predict binary hallucination risk based on extracted hidden states

Model or implementation: Llama-style MLP (down-projection, up-projection, gate, SiLU)

Modeling

Base Model: Llama-2-7B

Training Method: Training a probing classifier (Estimator) on top of frozen LLM states

Objective Functions:

Purpose: Minimize classification error for hallucination prediction.

Formally: Standard supervised classification loss (implied, not explicitly written as a formula in text).

Training Data:

Queries from Super-Natural Instructions (700+ datasets, 15 tasks)
Seen/Unseen data: BBC News 2020 (seen) vs. BBC News 2024 (unseen)
Labels derived from composite metric: (NLI Entailment AND Rouge-L > median AND Questeval > median) -> 1 (Faithful); else -> 0 (Hallucination)

Key Hyperparameters:

feature_selection: Top 8 neurons selected via Mutual Information for visualization analysis
probe_architecture: Llama-style MLP (Gate, Up, Down projections)

Compute: Not reported in the paper

Comparison to Prior Work

vs. PPL-based: Uses specific internal neurons rather than global output probabilities; PPL reflects fluency more than truthfulness
vs. Prompt-based: Accesses latent 'gut feeling' of the model which is often more accurate than its generated text explanation
vs. SAPLMA [not cited in paper]: SAPLMA also probes internal states for factuality, but this paper focuses specifically on the *pre-generation* risk estimation based on the query alone.

Limitations

Reliance on a composite automatic metric (NLI + Rouge + Questeval) for ground truth labeling, which may have its own noise.
Primary experiments focused on Llama-2-7B; generalization to larger or RLHF-heavy models needs further verification.
Binary classification of hallucination (0/1) simplifies the continuous nature of factual errors.

Reproducibility

Code: https://github.com/ziweiji/Internal_States_Reveal_Hallucination

Code is publicly available at https://github.com/ziweiji/Internal_States_Reveal_Hallucination. The paper specifies the source of datasets (Super-Natural Instructions, LatestEval) and the specific logic for creating ground-truth labels using NLI, Rouge-L, and Questeval.

📊 Experiments & Results

Evaluation Setup

Binary classification of hallucination risk on held-out test sets.

Benchmarks:

Seen/Unseen Classification (Data Contamination Detection) [New]
Hallucination Risk Estimation (Quality Estimation / Risk Prediction) [New]

Metrics:

Accuracy
F1 Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The internal state probe significantly outperforms prompt-based and perplexity-based baselines in detecting whether a query was seen during training.
BBC News (2020 vs 2024)	Accuracy	59.26	80.28	+21.02
The probe accurately predicts hallucination risk across diverse NLG tasks, consistently beating baselines.
15 NLG Tasks (Average)	Accuracy	56.40	84.32	+27.92
QA Task	Accuracy	55.88	86.11	+30.23
Information Extraction	Accuracy	59.09	89.06	+29.97

Experiment Figures

Visualization of the top-8 most significant neurons from the last layer for three tasks (Dialogue, QA, Translation), colored by hallucination level.

Main Takeaways

LLMs have a latent 'self-awareness' regarding their training data exposure (seen vs. unseen) encoded in their internal states.
This self-awareness is better accessed via probing hidden states than by asking the model directly (prompting) or checking output probability (perplexity).
Specific neurons in the last layer are highly correlated with uncertainty and hallucination risk, suggesting the mechanism is localized.
The method works across 15 diverse NLG tasks, not just QA, suggesting robust generalization of the uncertainty representation.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Transformer architecture (hidden states, layers)
Concept of Probing Classifiers in NLP
Hallucination metrics (NLI, Rouge, QuestEval)

Key Terms

Internal States: The vector representations (activations) at specific layers of a neural network as it processes input.

Probing Classifier: A small, simple model (like a linear classifier or MLP) trained on the frozen representations of a large pre-trained model to test if those representations encode specific properties.

NLI: Natural Language Inference—determining if a hypothesis is entailed by, contradicts, or is neutral to a premise; used here to check factual consistency.

QuestEval: A reference-dependent metric for evaluating faithfulness in generation tasks using question answering.

PPL: Perplexity—a measurement of how well a probability model predicts a sample; often used as a proxy for model uncertainty.

SiLU: Sigmoid Linear Unit—an activation function used in neural networks, specifically in the Llama architecture.

Llama-2-7B: A specific open-source Large Language Model released by Meta with 7 billion parameters.

MLP: Multilayer Perceptron—a class of feedforward artificial neural network.

Super-Natural Instructions: A large benchmark dataset containing a diverse set of NLP tasks and instructions.