← Back to Paper List

Prompt-Guided Internal States for Hallucination Detection of Large Language Models

Fujie Zhang, Peiqi Yu, Biao Yi, Baolei Zhang, Tong Li, Zheli Liu
Nankai University
arXiv (2024)
Factuality Benchmark

📝 Paper Summary

Hallucination detection Internal state analysis
PRISM enhances cross-domain hallucination detection by using specific prompts to make the internal representation of truthfulness in LLMs more salient and consistent across different domains.
Core Problem
Supervised hallucination detectors trained on LLM internal states often fail to generalize to new domains because truthfulness information is entangled with domain-specific details.
Why it matters:
  • Hallucinations in LLMs can mislead users, necessitating reliable detection mechanisms before deployment
  • Existing supervised methods require resource-intensive collection of training data for every new domain to perform well
  • Current unsupervised methods often struggle with accuracy or require significant additional inference time
Concrete Example: A detector trained on 'cities' data might learn features specific to geography rather than truthfulness. When tested on 'medical' data, it fails because the domain-specific geometric structure of the internal states has changed, even if the underlying concept of truthfulness exists.
Key Novelty
Prompt-Guided Internal States (PRISM)
  • Uses a prompt (e.g., 'Is the following statement true or false?') to contextualize the input text before extracting internal states
  • This prompting forces the LLM to focus on truthfulness, making the geometric separation between true and false statements more distinct (salient) and stable across domains (consistent)
Architecture
Architecture Figure Figure 1 (Concept)
PCA visualization of internal states with and without prompts. While not a system diagram, it illustrates the core mechanism.
Evaluation Highlights
  • Achieves +11.4% accuracy improvement over the SAPLM baseline on the True-False dataset when training on one domain and testing on others
  • Outperforms the best baseline (SAPLM) by +5.2% on the LogicStruct dataset, demonstrating robustness across different logical structures
  • Significantly increases the cosine similarity of 'truthfulness directions' between different domains (e.g., from 0.26 to 0.77 between 'cities' and 'companies' datasets)
Breakthrough Assessment
7/10
Simple yet effective intervention (prompting) that solves a major pain point (cross-domain generalization) in probe-based hallucination detection without requiring retraining of the LLM itself.
×