Cross-Layer Attention Probing for Fine-Grained Hallucination Detection

📝 Paper Summary

Hallucination detection Activation probing Hallucination mitigation

CLAP detects hallucinations by processing the entire residual stream of an LLM as a sequence using an attention mechanism, enabling fine-grained detection and reliable mitigation.

Core Problem

Existing activation probing methods examine individual layers in isolation, missing information distributed across the residual stream, and fail to distinguish fine-grained hallucinations among different sampled responses.

Why it matters:

LLMs frequently generate hallucinations, posing risks of misinformation in critical applications like healthcare and search
Direct mitigation methods (like DoLa) often degrade valid responses by modifying activations indiscriminately
Current probes struggle to generalize when prompts fall outside the training domain (out-of-distribution)

Concrete Example: When an LLM is asked a question, it might generate a correct answer in a greedy decode but hallucinate in a sampled response. A single-layer probe might miss the hallucination in the sampled response if the error signal is distributed across multiple layers or if the probe overfits to the greedy decoding pattern.

Key Novelty

Cross-Layer Attention Probing (CLAP)

Treats the set of activations from all LLM layers as a sequence of input tokens rather than isolated vectors
Uses a small transformer encoder with an attention mechanism to learn which layers contain the most relevant signals for detecting hallucinations
Incorporates responses sampled at high temperatures during training to learn fine-grained boundaries between hallucinated and non-hallucinated outputs

Architecture

The CLAP pipeline: extracting layer-wise activations, down-projecting them, adding a CLS token, and processing them via a Transformer Encoder to predict a binary label.

Evaluation Highlights

Outperforms single-layer probes on out-of-distribution tasks, achieving significant AUC gains (e.g., +6.5% vs. Last Layer on Llama-3-8B-Instruct)
Reduces abstention rate by 24.5% on average compared to baseline mitigation strategies while maintaining high non-hallucination rates
Combining CLAP with DoLa mitigation significantly reduces the rate at which correct responses are wrongly modified (NH->H errors) compared to using DoLa alone

Breakthrough Assessment

7/10

Novel architectural approach to probing that effectively uses the whole residual stream. Strong empirical results on out-of-distribution generalization and mitigation integration.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of LLM generated responses as hallucination or non-hallucination based on internal activations.

Inputs: Activations across all layers {x_{i,l}} for the last token of a generated response r_i

Outputs: Binary label y_i (hallucination/non-hallucination)

Pipeline Flow

LLM Inference (extract layer-wise activations)
Projection (reduce dimensions)
Sequence Construction (layer activations → tokens)
Cross-Layer Attention (Transformer Encoder)
Classification (Linear Head)

System Modules

Down-Projection Layer

Reduces the high dimensionality of LLM activations to a manageable size

Model or implementation: Learnable linear projection

Transformer Encoder

Learns relationships and attends to relevant layers across the residual stream

Model or implementation: Transformer Encoder block (1-2 layers)

Classification Head

Predicts the probability of hallucination

Model or implementation: Linear classifier

Novel Architectural Elements

Treating the vertical stack of layer activations as a horizontal sequence of tokens for a secondary Transformer
Applying attention mechanism specifically across the layer dimension (residual stream) for probing

Modeling

Base Model: Evaluated on Llama-7B, Alpaca-7B, Vicuna-7B, Gemma-2B, Llama3.1-Instruct-8B

Training Method: Supervised training of the CLAP probe (projection + encoder + classifier)

Objective Functions:

Purpose: Minimize classification error for hallucination detection.

Formally: Binary Cross-Entropy loss on the label y_i.

Trainable Parameters: Projection layer, Transformer Encoder (1-2 layers), Classification head (approx 1.1M params)

Training Data:

Natural Questions (NQ), Trivia QA (TQA), Strategy QA (STR), WikiData subsets
Augmented with K sampled responses per prompt (temp=1, top_p=0.95)

Key Hyperparameters:

d_model: 128 (projection dimension)
n_enc: 1 or 2 (encoder layers)
sampling_temperature: 1
+ 1 more
top_p: 0.95

Compute: Not reported in the paper

Comparison to Prior Work

vs. Linear/Non-linear Probes (LP/NLP): CLAP uses attention across all layers rather than a single layer
vs. Semantic Entropy: CLAP is a trained probe on activations rather than a sampling-based consistency check
vs. DoLa: CLAP is a detection method used to trigger mitigation, whereas DoLa is a mitigation strategy itself
+ 1 more
vs. P-True [not cited in paper]: P-True probes for truthfulness in few-shot settings; CLAP focuses on hallucination detection via cross-layer attention

Limitations

Computational cost scales with the number of layers in the LLM, requiring down-projection
Requires ground truth answers for training supervision
Performance gains vary across different LLM architectures (e.g., less gain on Gemma-2B vs AH baseline)

Reproducibility

No replication artifacts mentioned in the paper. Code, data splits, and trained weights are not provided. Prompt formats are in Appendix A.1.

📊 Experiments & Results

Evaluation Setup

Closed-book QA and reasoning tasks detecting hallucinations in generated responses

Benchmarks:

Natural Questions (NQ) (Open-domain QA)
Trivia QA (TQA) (Open-domain QA)
Strategy QA (STR) (Chain-of-thought reasoning)
WikiData subsets (Relation extraction (city-country, etc.) for OOD testing)

Metrics:

AUC (Area Under ROC Curve)
Accuracy
Macro-F1
Percentage of Non-Hallucinations (%NH)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
CLAP consistently improves out-of-distribution (OOD) generalization compared to single-layer probes and semantic entropy.
WikiData/NQ/TQA (Average across 20 pairs)	Gain in AUC (%)	0.0	6.5	+6.5
WikiData/NQ/TQA (Average across 20 pairs)	Gain in AUC (%)	0.0	4.1	+4.1
Mitigation experiments show that combining CLAP with DoLa (+CLAP-II) significantly reduces the rate of wrongfully modifying correct answers (NH->H) compared to using DoLa alone.
Average across 3 tasks	Abstention Rate (%Abs)	51.5	27.0	-24.5
Average across 3 tasks	% Non-Hallucinations (NH)	49.1	53.8	+4.7

Experiment Figures

Impact of mitigation strategies on response quality. Specifically, the rate of replacing hallucinated answers with non-hallucinated ones (H->NH) versus wrongfully replacing non-hallucinated answers (NH->H).

Main Takeaways

Cross-layer attention provides robust embeddings that generalize better to unseen domains than single-layer activations.
Training on sampled responses (fine-grained supervision) allows CLAP to better distinguish hallucinations even within the same prompt's response space.
Detect-then-mitigate (CLAP-II) is safer than direct mitigation (DoLa), as it preserves originally correct answers that DoLa might accidentally corrupt.
Discriminative information for hallucination is retained even when projecting layers to low dimensions (d=128), making the method efficient.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Transformer architecture (residual streams, layers)
Knowledge of LLM decoding strategies (greedy vs. sampling)
Familiarity with supervised classification and probing

Key Terms

residual stream: The sequence of hidden states (activations) passing through the stack of layers in a Transformer model

activation probing: Training a classifier (probe) on the internal activations of a frozen LLM to predict properties like truthfulness

DoLa: Decoding by Contrasting Layers—a mitigation method that contrasts output probabilities of the final layer with intermediate layers

greedy decoding: Generating text by always selecting the highest probability token at each step

semantic entropy: A measure of uncertainty based on the semantic meaning of multiple sampled responses rather than just token probabilities

fine-grained detection: The ability to distinguish between hallucinated and correct responses for the same prompt within the sampled response space

AUC: Area Under the ROC Curve—a performance metric for classification problems at various threshold settings

OOD: Out-of-Distribution—testing the model on data from a different domain than it was trained on

logit: The raw, unnormalized prediction scores generated by the last layer of a neural network before applying softmax