← Back to Paper List

INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection

Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, Jieping Ye
Alibaba Group
arXiv (2024)
Factuality QA

📝 Paper Summary

Hallucination detection Uncertainty estimation
INSIDE detects hallucinations by analyzing the eigenvalues of the covariance matrix of internal sentence embeddings to measure semantic divergence, coupled with feature clipping to reduce overconfidence.
Core Problem
Existing hallucination detection methods rely on logit-level uncertainty or language-level consistency, which lose dense semantic information during decoding and fail to detect self-consistent (overconfident) hallucinations.
Why it matters:
  • Token-level uncertainty (logits) is hard to aggregate into sentence-level metrics for sophisticated LLM responses.
  • Language-level consistency checks (lexical similarity) lose the rich semantic information preserved in the model's internal states.
  • Current methods struggle with 'overconfident hallucinations,' where models consistently generate the same wrong answer due to extreme internal feature activations.
Concrete Example: When an LLM is asked a question it doesn't know, it might confidently generate three consistent but wrong answers (hallucinations). A lexical similarity metric would rate this as 'consistent' (low hallucination risk), failing to detect the error. INSIDE analyzes the internal embeddings to find subtle semantic divergences or truncates extreme activations that cause this overconfidence.
Key Novelty
EigenScore metric and Test-Time Feature Clipping
  • Proposes EigenScore, a metric based on the eigenvalues of the covariance matrix of sentence embeddings, which essentially measures the differential entropy (semantic divergence) in the continuous embedding space.
  • Introduces a test-time feature clipping mechanism that truncates extreme activations in the neural network's internal layers, preventing the model from becoming artificially overconfident in its hallucinations.
Evaluation Highlights
  • Outperforms state-of-the-art baselines by +5.2% AUROC on the CoQA benchmark using LLaMA-2-7B-Chat.
  • Achieves best performance on TruthfulQA with an AUROC of 0.816, surpassing the strong SelfCheckGPT baseline (0.781).
  • Feature clipping alone improves hallucination detection AUROC by roughly 1-3% across multiple datasets (e.g., +2.9% on CoQA with LLaMA-2-7B-Chat).
Breakthrough Assessment
7/10
Offers a mathematically grounded metric (EigenScore as differential entropy) that effectively utilizes internal states, addressing a key limitation of text-based consistency methods. The feature clipping adds a practical robustness layer.
×