LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

📝 Paper Summary

Hallucination detection Internal representations Error analysis

LLMs encode truthfulness highly locally in exact answer tokens, allowing probing classifiers to detect errors and specific error types better than aggregation methods, though these signals do not generalize across distinct skills.

Core Problem

Existing hallucination detection methods often rely on extrinsic behavior or suboptimal internal signals (like aggregating logits across all tokens), failing to capture the rich, localized truthfulness information encoded within the model.

Why it matters:

Current detection methods miss crucial signals by inspecting the wrong tokens (e.g., last token of prompt or random generated tokens)
Understanding *where* and *how* truthfulness is encoded is essential for building reliable error detectors without external tools
Previous assumptions about 'universal truthfulness' directions may be flawed, risking the deployment of detectors that fail on new tasks

Concrete Example: In the sentence 'The capital of Connecticut is Hartford...', standard methods might average probabilities over the whole sentence or check the last token. This paper shows the truthfulness signal is concentrated specifically in the token 'Hartford'; probing elsewhere misses the signal, leading to lower detection accuracy.

Key Novelty

Exact-Answer Token Probing for Hallucination

Identifies that truthfulness information is spatially concentrated in the 'exact answer' tokens of a generation, rather than being spread evenly or located at the end
Demonstrates that linear classifiers trained on these specific token representations can predict not just correctness, but specific *types* of errors (e.g., consistent misconceptions vs. random noise)
Reveals a 'knowing-saying' gap where the model's internal representation classifies an answer as correct, yet the model explicitly generates the wrong answer

Architecture

Illustration of token selection strategies for error detection in long-form generation

Evaluation Highlights

Probing exact answer tokens outperforms logit-based baselines (Logits-min, P(True)) across almost all 10 datasets (e.g., +5-10% AUC improvements on TriviaQA/HotpotQA)
Truthfulness probes generalize within skills (e.g., QA to QA) but fail completely across skills (e.g., QA to Sentiment Analysis), challenging the 'universal truthfulness' hypothesis
Probing classifiers can distinguish between error types (e.g., 'consistent error' vs. 'occasional error') with high accuracy, which logit baselines cannot do

Breakthrough Assessment

7/10

Strong empirical evidence refining *where* to look for truthfulness (exact answer tokens) and debunking the universality of truth directions. High practical value for error detection.

⚙️ Technical Details

Problem Definition

Setting: White-box error detection in long-form LLM generation without external resources

Inputs: Input prompt q, generated response y_hat, and internal hidden states h

Outputs: Binary prediction z in {0,1} indicating if y_hat is correct or incorrect

Pipeline Flow

Generation: Model M produces response y_hat for prompt q
Extraction: Identify 'exact answer' tokens within y_hat
Probing: Extract hidden states h from specific layers at the exact answer tokens
Classification: Pass h through a trained linear classifier to predict correctness z

System Modules

Generator

Generate long-form response to the input question

Model or implementation: Mistral-7b / Llama-3-8b (Base & Instruct variants)

Token Extractor

Identify the indices of the exact answer tokens

Model or implementation: Heuristic or Instruct-LLM helper

Probing Classifier

Predict correctness based on hidden states

Model or implementation: Logistic Regression (Linear Probe)

Novel Architectural Elements

Targeting 'exact answer tokens' for probe training rather than last-token or aggregated sequence metrics

Modeling

Base Model: Mistral-7b, Mistral-7b-Instruct, Llama-3-8b, Llama-3-8b-Instruct

Training Method: Linear Probing (Logistic Regression) on frozen LLM states

Objective Functions:

Purpose: Minimize classification error of the probe.

Formally: Standard Logistic Regression objective (optimizing weights W to predict label z from hidden state h).

Adaptation: None (LLM is frozen; only probe is trained)

Training Data:

10 datasets including TriviaQA, HotpotQA, Natural Questions, Math, Winobias
Split into train/validation/test (sizes vary per dataset, typically ~1000s of examples)

Key Hyperparameters:

layer_selection: Selected based on validation set performance (typically middle-to-late layers)
token_selection: Last exact answer token (empirically best)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Logits-based: Probing uses vector representations rather than scalar outputs, capturing richer semantics
vs. Previous Probing (e.g., Li et al., 2024): Targets 'exact answer' tokens inside long generation instead of last token or fixed positions
vs. SAPLMA: Does not require multiple expensive generations; works on a single pass (though they analyze consistency for error typing)

Limitations

Probes do not generalize across different tasks (e.g., QA to Sentiment), requiring task-specific training data
Requires ground truth labels to train the probes initially
Exact answer extraction requires either heuristics or a secondary model call
Analysis is limited to 7B and 8B parameter models; scaling behavior not tested

Reproducibility

Code: https://github.com/technion-cs-nlp/LLMsKnow

Code available at https://github.com/technion-cs-nlp/LLMsKnow. Datasets are public benchmarks. Exact answer extraction uses heuristics or instruct models (details in Appendix).

📊 Experiments & Results

Evaluation Setup

Binary classification of generated answer correctness (Error Detection)

Benchmarks:

TriviaQA (Knowledge Retrieval QA)
HotpotQA (Multi-hop QA)
Natural Questions (Open-domain QA)
IMDB (Sentiment Analysis)

Metrics:

AUC (Area Under ROC Curve)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of error detection methods using the exact answer token location. Probing consistently outperforms logit-based and prompting baselines.
TriviaQA (Mistral-7b-Instruct)	AUC	0.72	0.83	+0.11
HotpotQA (Mistral-7b-Instruct)	AUC	0.62	0.74	+0.12
Natural Questions (Mistral-7b-Instruct)	AUC	0.63	0.75	+0.12
Generalization experiments showing that probes fail when transferred between dissimilar tasks (e.g., TriviaQA to IMDB).
Train: TriviaQA / Test: IMDB	AUC	0.78	0.58	-0.20

Experiment Figures

Heatmap of Probing AUC scores across different layers and token positions

Generalization matrix of AUC scores when training on one dataset and testing on another

Main Takeaways

Truthfulness information is highly localized in 'exact answer' tokens; probing these tokens yields significantly higher detection accuracy than probing random or final tokens.
Internal representations encode 'skill-specific' truthfulness; detectors generalize well between similar tasks (e.g., TriviaQA to HotpotQA) but fail across distinct skills (e.g., QA to Sentiment).
LLMs exhibit a 'knowing-saying' gap: in some cases, internal probes classify the *correct* answer with high confidence, even when the model actually generates an *incorrect* answer.
Internal states can distinguish between error types (e.g., 'Consistent Error' vs. 'Intermittent Error'), offering more granular diagnostics than confidence scores.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Transformer architecture (hidden states, layers)
Linear probing (training classifiers on intermediate activations)
ROC AUC metrics for binary classification

Key Terms

probing classifiers: Small linear models trained on the internal activations (hidden states) of a frozen LLM to predict properties of the input or generation

hallucinations: Any type of error generated by an LLM, including factual inaccuracies, biases, and reasoning failures

exact answer tokens: The specific tokens within a generated response that carry the core semantic content of the answer (e.g., 'Hartford' in a sentence about capitals)

logits: The raw, unnormalized scores output by the final layer of the model before the softmax function converts them to probabilities

AUC: Area Under the ROC Curve—a metric for binary classification performance where 0.5 is random guessing and 1.0 is perfect separation