← Back to Paper List

Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models

Weihang Su, Changyue Wang, Qingyao Ai, Yiran HU, Zhijing Wu, Yujia Zhou, Yiqun Liu
Department of Computer Science and Technology, Tsinghua University, School of Computer Science and Technology, Beijing Institute of Technology, School of Information, Renmin University of China
arXiv (2024)
Factuality Benchmark

📝 Paper Summary

Hallucination Detection Internal State Analysis
MIND is an unsupervised framework that detects hallucinations in real-time by training a simple classifier on the LLM's internal hidden states, using automatically generated pseudo-labels from Wikipedia truncations.
Core Problem
Existing hallucination detection methods rely on computationally expensive post-processing or require extensive human-annotated data, making them unsuitable for real-time applications or rapid model updates.
Why it matters:
  • Post-processing methods (like checking consistency or using a second LLM) add significant latency and cost, often doubling inference time
  • Supervised methods require expensive manual annotations that become obsolete as LLMs evolve rapidly
  • Current benchmarks often lack the internal state data needed to analyze *why* hallucinations occur during generation
Concrete Example: When an LLM is asked to complete a truncated Wikipedia article about a specific entity, it might generate a coherent but factually wrong continuation. Post-processing methods would need to retrieve external evidence or re-query the model to catch this, whereas MIND detects it instantly from the hidden states of the generated tokens.
Key Novelty
Unsupervised Modeling of Internal States (MIND)
  • Generates pseudo-labeled training data automatically by truncating Wikipedia articles and checking if the LLM can correctly reproduce the known next entity
  • trains a lightweight Multi-Layer Perceptron (MLP) directly on the LLM's contextualized embeddings (hidden states) to classify generation steps as hallucination or not
  • Operates in real-time during the inference process without needing external reference documents or separate verification models
Evaluation Highlights
  • MIND outperforms existing state-of-the-art methods in hallucination detection accuracy (specific metric values not in snippet, but qualitative claim is explicit)
  • Proves that a simple MLP using only the last token's embedding from the final layer is sufficient to distinguish hallucinations
  • Introduces HELM, a benchmark providing internal states (embeddings, attentions) for six different LLMs alongside human-annotated outputs
Breakthrough Assessment
7/10
Offers a practical, unsupervised solution to a major LLM reliability problem. The shift from post-hoc verification to real-time internal state monitoring is significant, though reliance on Wikipedia for training data is a common heuristic.
×