Prompt-Guided Internal States for Hallucination Detection of Large Language Models

📝 Paper Summary

Hallucination detection Internal state analysis

PRISM enhances cross-domain hallucination detection by using specific prompts to make the internal representation of truthfulness in LLMs more salient and consistent across different domains.

Core Problem

Supervised hallucination detectors trained on LLM internal states often fail to generalize to new domains because truthfulness information is entangled with domain-specific details.

Why it matters:

Hallucinations in LLMs can mislead users, necessitating reliable detection mechanisms before deployment
Existing supervised methods require resource-intensive collection of training data for every new domain to perform well
Current unsupervised methods often struggle with accuracy or require significant additional inference time

Concrete Example: A detector trained on 'cities' data might learn features specific to geography rather than truthfulness. When tested on 'medical' data, it fails because the domain-specific geometric structure of the internal states has changed, even if the underlying concept of truthfulness exists.

Key Novelty

Prompt-Guided Internal States (PRISM)

Uses a prompt (e.g., 'Is the following statement true or false?') to contextualize the input text before extracting internal states
This prompting forces the LLM to focus on truthfulness, making the geometric separation between true and false statements more distinct (salient) and stable across domains (consistent)

Architecture

PCA visualization of internal states with and without prompts. While not a system diagram, it illustrates the core mechanism.

Evaluation Highlights

Achieves +11.4% accuracy improvement over the SAPLM baseline on the True-False dataset when training on one domain and testing on others
Outperforms the best baseline (SAPLM) by +5.2% on the LogicStruct dataset, demonstrating robustness across different logical structures
Significantly increases the cosine similarity of 'truthfulness directions' between different domains (e.g., from 0.26 to 0.77 between 'cities' and 'companies' datasets)

Breakthrough Assessment

7/10

Simple yet effective intervention (prompting) that solves a major pain point (cross-domain generalization) in probe-based hallucination detection without requiring retraining of the LLM itself.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of statements as True or False based on LLM internal states

Inputs: A statement S and a Large Language Model M

Outputs: A binary label indicating whether S is a hallucination (False) or factual (True)

Pipeline Flow

Prompt Selection (Offline)
Prompt-Guided Inference
Feature Extraction
Classification

System Modules

Prompt Generator (Prompt Selection)

Generates candidate prompts using a meta-prompt (Prompt 2) queried to an LLM

Model or implementation: GPT-4o (used for generating candidates)

Prompt Selector (Prompt Selection)

Selects the best prompt by maximizing the variance ratio of the truthfulness direction on a labeled dataset

Model or implementation: Llama-2-7B-Chat

Feature Extractor (Inference & Detection)

Runs the LLM on 'Prompt + Statement' and extracts the hidden state of the last token

Model or implementation: Llama-2-7B-Chat

Hallucination Detector (Inference & Detection)

Classifies the extracted embedding vector as True or False

Model or implementation: Multi-Layer Perceptron (MLP)

Novel Architectural Elements

Prompt-guided feature extraction pipeline: Modifies the input to the probe by wrapping text in a truthfulness-oriented prompt, specifically selected to maximize structural variance in the activation space

Modeling

Base Model: Llama-2-7B-Chat

Training Method: Supervised training of a lightweight classifier probe (MLP) on frozen LLM embeddings

Objective Functions:

Purpose: Minimize classification error of the probe.

Formally: Standard Cross-Entropy Loss for binary classification.

Adaptation: None (LLM is frozen; only the probe is trained)

Trainable Parameters: Weights of the MLP probe (classifier)

Training Data:

True-False dataset (6 sub-datasets: animals, cities, companies, elements, facts, inventions)
LogicStruct dataset (24 sub-datasets: 6 topics x 4 logical structures)

Key Hyperparameters:

model: MLP (3 layers)
hidden_size: 256
activation: ReLU
+ 4 more
optimizer: Adam
learning_rate: 5e-4
batch_size: 16
epochs: 50

Compute: Single NVIDIA GeForce RTX 3090 GPU

Comparison to Prior Work

vs. SAPLM/LIT: PRISM modifies the input with prompts to align internal states before extraction, whereas SAPLM/LIT extract features from raw input
vs. LN-PP: PRISM is a supervised method using internal states, while LN-PP relies on output probabilities
vs. MM: PRISM uses a non-linear MLP classifier and prompt guidance, whereas MM uses a simple linear projection

Limitations

Relies on the assumption that the LLM has internal knowledge of the truth; cannot detect hallucinations where the model was never exposed to the fact
Requires a labeled dataset (in-domain) to train the probe and select the prompt
Experiments limited to Llama-2-7B-Chat; generalization to other model families not extensively tested in the main results

Reproducibility

Code: https://github.com/fujie-math/PRISM

Code and data are publicly available at https://github.com/fujie-math/PRISM. The paper explicitly lists the prompt template used, the specific LLM (Llama-2-7B-Chat), and hyperparameters for the probe classifier.

📊 Experiments & Results

Evaluation Setup

Cross-domain binary classification of statements. Train on one domain, test on others.

Benchmarks:

True-False Dataset (Fact verification (simple statements))
LogicStruct Dataset (Fact verification with logical structures (negation, conjunction, disjunction))

Metrics:

Accuracy
Variance Ratio (for structural salience analysis)
Cosine Similarity (for structural consistency analysis)
Statistical methodology: Averages reported over 5 runs with different random seeds.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on True-False dataset showing cross-domain accuracy improvements.
True-False Dataset	Accuracy	73.1	84.5	+11.4
Main comparison on LogicStruct dataset showing robustness to logical variations.
LogicStruct Dataset	Accuracy	73.7	78.9	+5.2
Analysis of structural consistency improvements.
True-False Dataset (cities vs companies)	Cosine Similarity	0.26	0.77	+0.51

Main Takeaways

Prompting significantly increases the variance ratio of the truthfulness direction, making true/false statements more separable in the embedding space
Prompting aligns the truthfulness directions across different domains (high cosine similarity), enabling detectors trained on one domain to generalize to others
PRISM consistently outperforms unsupervised (probability-based) and supervised (probe-based) baselines in cross-domain settings
The method is effective across different logical structures (negation, conjunction), not just simple affirmative sentences

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM internal states (hidden layers/activations)
Basic knowledge of PCA (Principal Component Analysis)
Familiarity with supervised classification probes (e.g., Logistic Regression)

Key Terms

Internal States: The activation values of neurons in the hidden layers of an LLM during inference

Truthfulness Direction: A vector direction in the activation space that maximally separates true statements from false statements

SAPLM: Self-Awareness Probe for Language Models—a baseline method that trains a classifier on internal activations

PCA: Principal Component Analysis—a dimensionality reduction technique used here to visualize the separation of true/false embeddings

Variance Ratio: A metric defined in this paper to measure how much of the total variance in embeddings is aligned with the truthfulness direction; higher means better separation

LIT: Language Model Internal States for Truthfulness—a baseline method similar to SAPLM

MM: Mass-Mean probe—a baseline method using the difference of means to classify statements

Llama-2-7B-Chat: The specific open-source Large Language Model used as the backbone for experiments