Hallucination Detection in LLMs Using Spectral Features of Attention Maps

📝 Paper Summary

Hallucination suppression Internal state analysis

LapEigvals detects hallucinations by treating LLM attention maps as graph adjacency matrices and using the top eigenvalues of their Laplacian as input features for a detection probe.

Core Problem

LLMs frequently generate hallucinations (nonsensical or unfaithful content) in safety-critical applications, and existing detection methods often fail to capture the subtle internal signals indicating these errors.

Why it matters:

Eliminating hallucinations entirely is currently impossible, making reliable post-hoc detection essential for safe deployment
Previous attention-based detection methods (like AttentionScore) often lack robustness or statistical separability between hallucinated and correct answers
Understanding internal model states during hallucinations can offer insights into the mechanisms of model failure

Concrete Example: In TriviaQA, an LLM might answer a question incorrectly. Previous methods like AttentionScore (sum of log-determinants) often yield overlapping distributions for correct vs. incorrect answers (high p-values). LapEigvals shows significantly lower p-values, indicating distinct spectral signatures when the model hallucinates.

Key Novelty

LapEigvals (Laplacian Eigenvalues for Hallucination Detection)

Interprets the attention mechanism as a directed graph where tokens are nodes and attention scores are edge weights, representing information flow
Constructs a Laplacian matrix from these attention maps to capture structural properties like information bottlenecks, which hypothesized to correlate with hallucinations
Uses the top-k eigenvalues of this Laplacian matrix as a compact, informative feature vector for training a simple logistic regression probe

Architecture

The pipeline for converting attention maps into Laplacian eigenvalues and training a probe. Figure 2 visualizes the graph interpretation of attention.

Evaluation Highlights

Achieves state-of-the-art hallucination detection performance on 6 out of 7 QA datasets (including TriviaQA, CoQA, SQuADv2) across 5 LLM families
Outperforms baseline AttentionScore and raw Attention Eigenvalues, with LapEigvals reaching ~0.82 AUROC on TriviaQA (Mistral-Small-24B) vs ~0.77 for best baseline
Demonstrates robust generalization: training on one dataset (e.g., TriviaQA) and testing on another (e.g., NQ-Open) yields minimal performance drops compared to baselines

Breakthrough Assessment

7/10

Strong empirical results surpassing existing attention-based methods. The graph-theoretic interpretation of attention for hallucination detection is a novel and effective perspective, though it relies on a standard probing setup.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of generated answers as 'Hallucination' or 'Non-Hallucination' based on internal model features.

Inputs: Attention maps A from all layers and heads generated during inference for a prompt

Outputs: Probability that the generated answer is a hallucination

Pipeline Flow

LLM Inference (Generate answer and extract Attention Maps)
Laplacian Construction (Compute L = I - D^{-1}A)
Spectral Feature Extraction (Compute top-k Eigenvalues)
Dimensionality Reduction (PCA)
Hallucination Probe (Logistic Regression)

System Modules

LLM Inference

Generate text and expose internal attention weights

Model or implementation: Llama-3.1-8B, Llama-3.2-3B, Phi-3.5, Mistral-Nemo, Mistral-Small-24B

Feature Extractor

Transform raw attention maps into spectral features

Model or implementation: Laplacian Eigenvalue Decomposition

Hallucination Probe

Classify the feature vector as Hallucination or Not

Model or implementation: Logistic Regression (scikit-learn)

Novel Architectural Elements

Usage of Laplacian Eigenvalues of attention maps specifically as input features for hallucination probes
Definition of a specific directed Laplacian for attention maps: L = I - D^{-1}A where D is the out-degree matrix

Modeling

Base Model: Evaluated on 5 models: Llama-3.1-8B-Instruct, Llama-3.2-3B-Instruct, Phi-3.5-mini-instruct, Mistral-Nemo-Instruct-2407, Mistral-Small-24B-Instruct-2501

Training Method: Supervised training of a logistic regression probe (the LLM itself is frozen)

Objective Functions:

Purpose: Minimize classification error of the probe.

Formally: Standard Logistic Regression loss (Cross-Entropy)

Training Data:

7 QA datasets: NQ-Open, TriviaQA, CoQA, SQuADv2, HaluEvalQA, TruthfulQA, GSM8K
Labels generated via LLM-as-a-judge (GPT-4o-mini) or exact match (GSM8K)

Key Hyperparameters:

probe_max_iter: 2000
probe_class_weight: balanced
pca_dimensions: 512
+ 2 more
top_k_eigenvalues: Selected from {5, 10, 20, 50, 100}
decoding_temperature: 0.1 and 1.0

Compute: Not reported in the paper

Comparison to Prior Work

vs. AttentionScore: LapEigvals is supervised and uses spectral graph features rather than determinant-based heuristics
vs. AttnEigvals: Uses Laplacian matrix eigenvalues instead of raw attention matrix eigenvalues, showing that the Laplacian transformation is crucial for performance
vs. Hidden States [not cited in paper as main baseline, but compared in appendix]: LapEigvals outperforms hidden state probes in most settings, suggesting attention structure holds unique signals

Limitations

Requires access to internal attention maps, which is not possible for API-only black-box models
Computational cost of eigendecomposition for every attention head at every layer is higher than simple scalar aggregations
Performance degrades on datasets with severe class imbalance or small sizes (e.g., TruthfulQA)
Does not generalize well to distinct domains like math problems (GSM8K)

Reproducibility

Code: https://github.com/graphml-lab-pwr/lapeigvals

publicly available (https://github.com/graphml-lab-pwr/lapeigvals). Code provided. Datasets are standard public benchmarks. LLMs are open-weights models.

📊 Experiments & Results

Evaluation Setup

Post-hoc binary classification of generated answers on QA tasks

Benchmarks:

TriviaQA (Open-domain QA)
NQ-Open (Open-domain QA)
CoQA (Conversational QA)
SQuADv2 (Reading Comprehension)
HaluEvalQA (Hallucination Evaluation)
TruthfulQA (Truthfulness)
GSM8K (Grade School Math)

Metrics:

AUROC (Area Under Receiver Operating Characteristic)
Statistical methodology: Two-sided Mann-Whitney U test reported for feature significance analysis

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison shows LapEigvals outperforming baselines on TriviaQA across different models.
TriviaQA	AUROC	0.77	0.82	+0.05
TriviaQA	AUROC	0.75	0.82	+0.07
Ablation study on input size (k eigenvalues) shows LapEigvals is more efficient.
TriviaQA	AUROC	0.77	0.79	+0.02
Robustness across prompts analysis.
TriviaQA	Standard Deviation (AUROC)	0.07	0.05	-0.02

Experiment Figures

Comparison of p-values from Mann-Whitney U test for Laplacian Eigenvalues vs AttentionScore across layers.

Performance (AUROC) vs. Number of Eigenvalues (k).

Main Takeaways

LapEigvals consistently outperforms raw attention eigenvalues (AttnEigvals) and log-determinant features (AttnLogDet) across 6/7 datasets.
The method is robust to decoding temperature, maintaining superiority at both temp=0.1 and temp=1.0.
Information indicating hallucinations is distributed across layers; combining features from 'all layers' consistently beats the best 'single layer' probe.
LapEigvals generalizes better than baselines when trained on one dataset and tested on another, showing smaller performance drops.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Attention mechanism)
Graph Theory (Adjacency matrix, Laplacian matrix, Eigenvalues)
Logistic Regression
Principal Component Analysis (PCA)

Key Terms

Attention Map: A matrix representing how much focus each token puts on every other token in a Transformer model

Laplacian Matrix: A matrix representation of a graph (L = D - A) that captures structural properties like connectivity and flow; here defined specifically for directed attention graphs

Eigenvalues: Scalar values associated with a linear transformation (matrix) that characterize its fundamental properties; in graphs, they describe connectivity and partitioning

Probing: Training a simple classifier (probe) on internal representations of a pre-trained model to predict specific properties (here, truthfulness)

AUROC: Area Under the Receiver Operating Characteristic Curve—a metric for binary classification performance, where 0.5 is random guessing and 1.0 is perfect

PCA: Principal Component Analysis—a technique to reduce the dimensionality of data while preserving as much variance (information) as possible

Log-determinant: The natural logarithm of the determinant of a matrix, used in prior work (AttentionScore) as a summary statistic for attention maps

Out-degree matrix: A diagonal matrix where each entry represents the sum of outgoing edge weights for a node; used here to normalize attention flow