INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection

📝 Paper Summary

Hallucination detection Uncertainty estimation

INSIDE detects hallucinations by analyzing the eigenvalues of the covariance matrix of internal sentence embeddings to measure semantic divergence, coupled with feature clipping to reduce overconfidence.

Core Problem

Existing hallucination detection methods rely on logit-level uncertainty or language-level consistency, which lose dense semantic information during decoding and fail to detect self-consistent (overconfident) hallucinations.

Why it matters:

Token-level uncertainty (logits) is hard to aggregate into sentence-level metrics for sophisticated LLM responses.
Language-level consistency checks (lexical similarity) lose the rich semantic information preserved in the model's internal states.
Current methods struggle with 'overconfident hallucinations,' where models consistently generate the same wrong answer due to extreme internal feature activations.

Concrete Example: When an LLM is asked a question it doesn't know, it might confidently generate three consistent but wrong answers (hallucinations). A lexical similarity metric would rate this as 'consistent' (low hallucination risk), failing to detect the error. INSIDE analyzes the internal embeddings to find subtle semantic divergences or truncates extreme activations that cause this overconfidence.

Key Novelty

EigenScore metric and Test-Time Feature Clipping

Proposes EigenScore, a metric based on the eigenvalues of the covariance matrix of sentence embeddings, which essentially measures the differential entropy (semantic divergence) in the continuous embedding space.
Introduces a test-time feature clipping mechanism that truncates extreme activations in the neural network's internal layers, preventing the model from becoming artificially overconfident in its hallucinations.

Evaluation Highlights

Outperforms state-of-the-art baselines by +5.2% AUROC on the CoQA benchmark using LLaMA-2-7B-Chat.
Achieves best performance on TruthfulQA with an AUROC of 0.816, surpassing the strong SelfCheckGPT baseline (0.781).
Feature clipping alone improves hallucination detection AUROC by roughly 1-3% across multiple datasets (e.g., +2.9% on CoQA with LLaMA-2-7B-Chat).

Breakthrough Assessment

7/10

Offers a mathematically grounded metric (EigenScore as differential entropy) that effectively utilizes internal states, addressing a key limitation of text-based consistency methods. The feature clipping adds a practical robustness layer.

⚙️ Technical Details

Problem Definition

Setting: Knowledge hallucination detection in natural language generation (QA tasks)

Inputs: Input context x and K generated responses

Outputs: A scalar score indicating the likelihood of hallucination (uncertainty score)

Pipeline Flow

Generation: Sample K responses for input x
Feature Clipping: Truncate extreme activations in penultimate layer during generation
Embedding Extraction: Extract sentence embeddings (last token of middle/penultimate layer)
Covariance Calculation: Compute covariance matrix of the K embeddings
Scoring: Compute EigenScore via SVD of covariance matrix

System Modules

Feature Clipper

Truncate abnormal activations during inference to reduce overconfidence

Model or implementation: Piecewise linear function

Embedding Extractor (Metric Computation)

Extract dense semantic representations of responses

Model or implementation: LLM Internal Layers

EigenScore Calculator (Metric Computation)

Compute semantic divergence score

Model or implementation: SVD / LogDet

Novel Architectural Elements

Integration of spectral analysis (Eigenvalues of covariance matrix) directly into the hallucination detection pipeline.
Test-time intervention mechanism (Feature Clipping) applied to internal activations specifically for hallucination detection.

Modeling

Base Model: LLaMA-2-7B-Chat, LLaMA-2-13B-Chat, Vicuna-7B-v1.5, Vicuna-13B-v1.5

Training Method: Inference-time method only

Key Hyperparameters:

clip_percentile_p: 0.2
regularization_term_alpha: Not explicitly reported in the paper
number_of_generations_K: Not explicitly reported in the paper (implied standard sampling)

Compute: SVD computation on KxK matrix (negligible compared to LLM generation)

Comparison to Prior Work

vs. Lexical Similarity: INSIDE uses dense embedding space instead of discrete token overlap, capturing better semantic nuance.
vs. Perplexity: INSIDE measures sentence-level divergence across multiple samples rather than single-sequence token probability.
vs. SelfCheckGPT: INSIDE adds feature clipping to handle overconfidence and uses spectral methods for efficiency.
+ 1 more
vs. semantic entropy [not cited in paper]: Semantic entropy clusters generations by meaning; INSIDE uses continuous covariance eigenvalues to estimate entropy without explicit clustering.

Limitations

Computational cost of generating multiple responses (K samples) is high compared to single-generation methods.
Requires access to internal model states (white-box), making it unusable for API-only models like GPT-4.
Performance depends on the quality of the embedding space of the specific LLM used.

Reproducibility

Code: https://github.com/alibaba/eigenscore

Code is publicly available at https://github.com/alibaba/eigenscore. The method is training-free and relies on inference-time statistics. Hyperparameters for clipping (p=0.2) are provided. Specific alpha for regularization is not detailed.

📊 Experiments & Results

Evaluation Setup

QA tasks where model generation is compared against ground truth; hallucination detection treated as binary classification (correct vs. incorrect).

Benchmarks:

CoQA (Conversational Question Answering)
TruthfulQA (Truthfulness benchmark)
TriviaQA (Reading Comprehension / QA)

Metrics:

AUROC (Area Under Receiver Operating Characteristic)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison results demonstrating INSIDE (EigenScore + Feature Clipping) performance against baselines on LLaMA-2-7B-Chat.
CoQA	AUROC	0.777	0.829	+0.052
TruthfulQA	AUROC	0.781	0.816	+0.035
TriviaQA	AUROC	0.788	0.806	+0.018
Ablation showing the specific contribution of Feature Clipping (FC) when added to the EigenScore metric.
CoQA	AUROC	0.800	0.829	+0.029
TruthfulQA	AUROC	0.793	0.816	+0.023

Experiment Figures

Activation distribution of the penultimate layer in LLaMA-7B.

Main Takeaways

EigenScore consistently outperforms language-level consistency metrics (like Lexical Similarity) and logit-level metrics (Perplexity), validating the use of internal dense embeddings.
Feature Clipping (FC) universally improves detection performance across benchmarks, confirming that truncating extreme activations helps mitigate overconfident hallucinations.
The method generalizes across different model families (LLaMA-2 and Vicuna) and sizes (7B and 13B).
Using the middle layer's last token embedding proves more effective for EigenScore than using the final layer, suggesting semantic information is best captured before the final output formatting.

📚 Prerequisite Knowledge

Prerequisites

Linear Algebra (Covariance matrices, Eigenvalues, SVD)
Information Theory (Differential Entropy)
Large Language Model architecture (Transformer internal states)

Key Terms

MSP: Maximum Softmax Probability—a baseline uncertainty measure using the highest probability token at each step.

EigenScore: The proposed metric calculated as the logarithm determinant (sum of log eigenvalues) of the covariance matrix of sentence embeddings.

LogDet: Logarithm of the determinant of a matrix.

Differential Entropy: The entropy of a continuous random variable; the paper proves EigenScore is equivalent to this for Gaussian distributions.

AUROC: Area Under the Receiver Operating Characteristic curve—a standard metric for binary classification performance.

internal states: The hidden layer representations (embeddings) within the LLM, specifically the penultimate layer in this work.

feature clipping: Truncating the values of hidden neuron activations that fall into extreme percentiles (e.g., top/bottom 0.2%) to reduce model overconfidence.

Lexical Similarity: A baseline method measuring consistency by comparing word overlap (e.g., ROUGE scores) between generated responses.

SelfCheckGPT: A strong baseline method for hallucination detection that checks consistency among sampled responses.