What do Geometric Hallucination Detection Metrics Actually Measure?

📝 Paper Summary

Hallucination suppression Hallucination detection Internal state analysis

Geometric metrics derived from LLM internal states effectively detect hallucinations but are highly sensitive to domain shifts; a proposed perturbation-based normalization method restores detection performance across mixed domains.

Core Problem

Existing geometric hallucination detectors (Hidden Score, Attention Score) are sensitive to task domain changes, causing their performance to degrade significantly when applied in multi-domain settings.

Why it matters:

Hallucinations remain a major barrier to deploying generative models in high-consequence applications where external ground truth is unavailable.
Current methods often fail to generalize: a detector tuned for math questions fails on history questions because the statistic's variance across domains exceeds the detection margin.
Understanding which specific characteristics of hallucinations (e.g., irrelevance vs. incoherence) trigger these geometric signals is crucial for building reliable detectors.

Concrete Example: A detector using Hidden Score achieves high accuracy (0.92 AUROC) on math multiplication problems. However, when tested on a mixed dataset including history and counting tasks, the baseline score shifts so much that the detector cannot distinguish a math hallucination from a correct history answer, dropping AUROC to 0.57.

Key Novelty

Perturbation Normalization for Geometric Hallucination Detection

Instead of using raw geometric scores (like log determinants of hidden states), the method compares the score of a response against scores from 'neighboring' perturbed responses.
By calculating how much an answer is an outlier relative to local variations (e.g., slightly different numbers), the method cancels out the domain-specific baseline shifts.
This aligns the score distributions across different topics (math, history), allowing a single threshold to work effectively for multi-domain detection.

Evaluation Highlights

+34 to +40 point increase in AUROC for multi-domain hallucination detection using the proposed normalization method compared to raw statistics.
Hidden Score and Matrix Entropy achieve 0.96 AUROC on the mixed-domain 'all' dataset after normalization, up from ~0.57.
Identifies that different metrics target different errors: Matrix Entropy uniquely detects incoherence (repetition), while Hidden/Attention Scores fail to detect it (performing worse than random).

Breakthrough Assessment

7/10

Provides a significant practical fix (normalization) for a major failure mode (domain shift) in unsupervised hallucination detection. The analysis of what specific metrics capture is valuable, though the method relies on synthetic perturbations which may be harder to generate for non-numeric tasks.

⚙️ Technical Details

Problem Definition

Setting: Unsupervised detection of hallucinations in LLM outputs using internal geometric properties of hidden states.

Inputs: Prompt P and Response R (generated by LLM).

Outputs: A scalar score indicating the likelihood that Response R is a hallucination.

Pipeline Flow

Prompt/Response Generation (Synthetic Dataset)
Feature Extraction (Teacher Forcing)
Metric Calculation (HS, AS, ME)
Perturbation Normalization (Optional)

System Modules

Synthetic Data Generator

Creates prompt-response pairs with controlled hallucination types (incorrectness, irrelevance, etc.) across 3 domains.

Model or implementation: Programmatic Templates

Feature Extractor (Inference)

Runs the LLM on the PR pairs to extract internal states.

Model or implementation: Llama-3.1-8B-Instruct

Metric Calculator (Inference)

Computes geometric statistics on the extracted states.

Model or implementation: Analytical Functions

Normalizer

Mitigates domain shift by normalizing scores against perturbed neighbors.

Model or implementation: Statistical Normalization

Novel Architectural Elements

Perturbation Normalization module: A post-processing step that requires running the model on k variations of the answer to compute a local baseline for the geometric statistic.

Modeling

Base Model: Llama-3.1-8B-Instruct

Compute: Single Nvidia H100 GPU

Comparison to Prior Work

vs. Sriramanan et al.: The paper uses the same base metrics (HS, AS) but adds perturbation normalization to solve the domain generalization failure case.
vs. Probes: This method is unsupervised (zero-resource) and does not require training a classifier on labeled hallucination data.
vs. Semantic Entropy: This method uses internal geometric properties of representations rather than output logits/probabilities.

Limitations

The normalization method requires generating k perturbed variations, which triples (or more) the inference cost.
Perturbation strategies in the paper are domain-specific (adding integers) and may be harder to design for open-ended text generation.
Experiments are limited to short, structured QA tasks (Math, History dates, Counting) rather than long-form text generation.
Reliance on 'Teacher Forcing' for evaluation assumes access to the forced generation path.

📊 Experiments & Results

Evaluation Setup

Binary classification of LLM outputs as 'Hallucination' or 'Correct' based on geometric scores.

Benchmarks:

Math (Integer multiplication) [New]
History (Date retrieval (Year of event)) [New]
Counting (Word count in sequence) [New]

Metrics:

AUROC (Area Under Receiver Operating Characteristic)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance of raw geometric statistics on detecting factual incorrectness (Level 3 severity). Shows strong single-domain performance but collapse on mixed domains.
Math	AUROC	0.50	0.92	+0.42
History	AUROC	0.50	0.75	+0.25
All (Mixed Domain)	AUROC	0.50	0.57	+0.07
Impact of Perturbation Normalization on mixed-domain detection (Level 1 Incorrectness).
All (Mixed Domain)	AUROC	0.56	0.96	+0.40
All (Mixed Domain)	AUROC	0.55	0.89	+0.34
Sensitivity to specific hallucination types (Level 3 severity on 'All' dataset).
All (Mixed Domain)	AUROC	0.50	0.96	+0.46
All (Mixed Domain)	AUROC	0.50	0.99	+0.49

Main Takeaways

Different geometric statistics capture different hallucination types: Matrix Entropy is uniquely sensitive to Incoherence (repetition), while Hidden/Attention scores are better for Irrelevance.
Domain shift is a critical failure mode for geometric detectors; the variance in scores between domains (Math vs History) is larger than the variance between correct/incorrect answers.
Perturbation Normalization effectively cancels out domain baselines, allowing a single detector to work across Math, History, and Counting tasks with high accuracy.
Optimal layers for detection vary by domain (Layer 30-31 for Math, 14-16 for History), but normalization aligns performance at later layers.

📚 Prerequisite Knowledge

Prerequisites

Linear Algebra (Gram matrices, eigenvalues, determinants)
Information Theory (Entropy)
Transformer Architecture (Residual stream, Attention heads)
Teacher Forcing

Key Terms

Hidden Score (HS): The sequence-normalized log determinant of the Gram matrix formed by hidden states; measures the volume/diffusiveness of the representation space.

Attention Score (AS): The sequence-normalized log determinant of the attention map; measures how much tokens attend to themselves.

Matrix Entropy (ME): A measure of information content in the hidden state Gram matrix, specifically using Von Neumann or Shannon entropy formulations.

Teacher Forcing: A training/evaluation technique where the model is fed the ground-truth (or specific target) history at each step rather than its own previous generations.

AUROC: Area Under the Receiver Operating Characteristic curve; a metric for binary classification performance independent of the decision threshold.

Gram Matrix: A matrix computed by multiplying a matrix by its transpose; captures the correlations between different token representations.

Perturbation Normalization: Proposed method: normalizing a statistic by comparing it to the average statistic of k artificially perturbed versions of the response.