Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations

📝 Paper Summary

Hallucination detection Mechanistic interpretability Internal state probing

The paper reveals that LLMs encode truthfulness through two distinct pathways—one relying on question-answer information flow and another deriving evidence solely from the generated answer—and proposes detection methods exploiting this distinction.

Core Problem

While internal LLM representations are known to encode truthfulness signals, the specific mechanisms by which these signals arise and operate remain largely unexplored.

Why it matters:

Understanding the origin of truthfulness cues is crucial for building reliable generative systems, as black-box detection methods often fail to explain underlying causes
Current internal probing methods treat all signals uniformly, potentially missing nuances in how models process well-known facts versus long-tail knowledge

Concrete Example: When a model answers 'Columbia' for the capital of South Carolina, it might rely on the specific question context (Q-Anchored). However, for other facts, it might generate an answer and then self-validate it independently of the question (A-Anchored). Treating these mechanisms identically limits detection accuracy.

Key Novelty

Two distinct truthfulness pathways: Question-Anchored (Q-Anchored) and Answer-Anchored (A-Anchored)

Q-Anchored pathway: Truthfulness signals depend heavily on the information flow from the exact question tokens to the answer
A-Anchored pathway: Truthfulness signals are self-contained within the generated answer and remain robust even when question information is blocked or removed
Mixture-of-Probes (MoP) & Pathway Reweighting (PR): New detection strategies that train specialized classifiers for each pathway or amplify pathway-specific signals

Architecture

Saliency distribution of attention from Exact Question tokens to Answer tokens, showing a bimodal distribution.

Evaluation Highlights

Achieves up to 10% AUC gain in hallucination detection across various datasets and models using the proposed pathway-aware methods
Demonstrates that Q-Anchored encoding dominates for well-known facts (within knowledge boundaries), while A-Anchored encoding is favored for long-tail cases
Saliency analysis reveals a bimodal distribution in attention dependency, statistically confirming the existence of two distinct mechanisms

Breakthrough Assessment

8/10

Provides significant mechanistic insight into LLM hallucinations by identifying two distinct encoding pathways. The findings are well-supported by ablation studies and lead to practical improvements in detection.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of generated answers as hallucinatory or factual based on internal hidden representations

Inputs: Input question q and model-generated answer y_hat

Outputs: Binary label z (0 or 1) indicating if the answer is hallucinatory

Pipeline Flow

Input Processing (Question + Answer)
Pathway Identification (Saliency/Knockout Analysis)
Truthfulness Detection (Pathway-Aware Probing)

System Modules

Base LLM

Generate answers and provide internal hidden states for analysis

Model or implementation: Various (Llama-3, Mistral, Qwen3)

Pathway Analyzer

Determine if an instance is Q-Anchored or A-Anchored via attention knockout or saliency

Model or implementation: Algorithmic analysis on attention maps

Pathway-Aware Detector

Predict hallucination probability using MoP or PR strategies

Model or implementation: Linear Probes (Logistic Regression)

Novel Architectural Elements

Mixture-of-Probes (MoP) architecture for hallucination detection, utilizing distinct classifiers for Q-Anchored and A-Anchored samples
Pathway Reweighting (PR) mechanism to modulate activation magnitudes based on pathway relevance

Modeling

Base Model: Evaluated on 12 models including Llama-3.2-1B/3B, Llama-3-8B/70B, Mistral-7B-v0.1/v0.3, and Qwen3-8B/32B

Training Method: Linear probing on frozen LLM representations

Training Data:

Datasets: PopQA, TriviaQA, HotpotQA, Natural Questions
Generated answers labeled as factual/hallucinatory based on exact match/string inclusion

Compute: Inference-only on pre-trained models; Probing classifiers are lightweight linear models

Comparison to Prior Work

vs. Standard Probing (SAPLMA): Separates instances by mechanism (Q- vs A-Anchored) rather than training a single monolithic probe
vs. Extrinsic methods: Relies purely on internal states without external retrieval or multiple sampling
vs. Geometry of Truth (Burns et al.) [cited in paper]: Focuses on the *mechanism* (pathways) of truthfulness rather than just the existence of a direction

Limitations

Analysis relies on identifying 'exact tokens' using semantic frame theory, which may be ambiguous for complex queries
Experiments focus on question-answering tasks; generalization to open-ended creative writing or reasoning is not tested
Requires access to internal model weights and activations, limiting applicability to closed-source API models

Reproducibility

Code availability is not provided in the paper. The methodology (attention knockout, patching, probing) is described in detail. Datasets used are standard public benchmarks.

📊 Experiments & Results

Evaluation Setup

Hallucination detection on open-ended question answering tasks

Benchmarks:

PopQA (Long-tail entity QA)
TriviaQA (Open-domain QA)
HotpotQA (Multi-hop QA)
Natural Questions (Open-domain QA)

Metrics:

AUC (Area Under the ROC Curve)
Statistical methodology: 95% confidence intervals reported for attention knockout results

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The paper qualitatively demonstrates performance gains but does not provide a consolidated main results table with exact numeric comparisons against baselines in the provided text. The text mentions 'up to a 10% AUC gain' generally.

Experiment Figures

Probability changes in hallucination detection after 'Attention Knockout' (blocking question tokens).

Prediction flip rate when 'patching' (injecting) hallucinatory cues into the question.

Probe behavior when the question is entirely removed (Answer-Only input).

Main Takeaways

Internal truthfulness encoding is not uniform; it splits into Q-Anchored (dependent on question flow) and A-Anchored (dependent on answer self-evidence) pathways.
Q-Anchored mechanisms are more prevalent for facts within the model's knowledge boundary, while A-Anchored mechanisms appear more often for long-tail/uncertain knowledge.
The model's internal representations are 'aware' of which pathway is being used, allowing for the training of specialized probes (Mixture-of-Probes).
Blocking information from exact question tokens significantly impacts Q-Anchored predictions but leaves A-Anchored predictions largely unchanged, validating the mechanistic distinction.

📚 Prerequisite Knowledge

Prerequisites

Transformer attention mechanisms (queries, keys, values)
Linear probing (training classifiers on hidden states)
Mechanistic interpretability concepts (saliency, patching)

Key Terms

Q-Anchored Pathway: A truthfulness encoding mechanism that relies heavily on information flow from the question's exact tokens to the answer

A-Anchored Pathway: A truthfulness encoding mechanism where signals are derived primarily from the generated answer itself, independent of the question

Exact tokens: Core frame elements in the text, such as the specific subject and property in a question or the critical entity in an answer

Attention knockout: A technique to block information flow by setting specific attention weights to zero during inference

Token patching: Replacing specific tokens in the input with tokens from a different sample to test causal effects on model representations

Saliency analysis: A method to measure the importance of specific input features (like attention weights) for a model's output or loss

Mixture-of-Probes (MoP): A proposed detection method using specialized classifiers (experts) for different truthfulness pathways

Pathway Reweighting (PR): A proposed method that modulates internal activation intensity to emphasize signals most relevant to truthfulness detection

AUC: Area Under the Curve—a performance metric for classification tasks, measuring the ability to distinguish between classes