← Back to Paper List

What do Geometric Hallucination Detection Metrics Actually Measure?

Eric Yeats, John Buckheit, Sarah Scullen, Brendan Kennedy, Loc Truong, Davis Brown, Bill Kay, Cliff Joslyn, Tegan Emerson, Michael J. Henry, John Emanuello, Henry Kvinge
Not explicitly listed in paper text
arXiv (2026)
Factuality Benchmark QA

📝 Paper Summary

Hallucination suppression Hallucination detection Internal state analysis
Geometric metrics derived from LLM internal states effectively detect hallucinations but are highly sensitive to domain shifts; a proposed perturbation-based normalization method restores detection performance across mixed domains.
Core Problem
Existing geometric hallucination detectors (Hidden Score, Attention Score) are sensitive to task domain changes, causing their performance to degrade significantly when applied in multi-domain settings.
Why it matters:
  • Hallucinations remain a major barrier to deploying generative models in high-consequence applications where external ground truth is unavailable.
  • Current methods often fail to generalize: a detector tuned for math questions fails on history questions because the statistic's variance across domains exceeds the detection margin.
  • Understanding which specific characteristics of hallucinations (e.g., irrelevance vs. incoherence) trigger these geometric signals is crucial for building reliable detectors.
Concrete Example: A detector using Hidden Score achieves high accuracy (0.92 AUROC) on math multiplication problems. However, when tested on a mixed dataset including history and counting tasks, the baseline score shifts so much that the detector cannot distinguish a math hallucination from a correct history answer, dropping AUROC to 0.57.
Key Novelty
Perturbation Normalization for Geometric Hallucination Detection
  • Instead of using raw geometric scores (like log determinants of hidden states), the method compares the score of a response against scores from 'neighboring' perturbed responses.
  • By calculating how much an answer is an outlier relative to local variations (e.g., slightly different numbers), the method cancels out the domain-specific baseline shifts.
  • This aligns the score distributions across different topics (math, history), allowing a single threshold to work effectively for multi-domain detection.
Evaluation Highlights
  • +34 to +40 point increase in AUROC for multi-domain hallucination detection using the proposed normalization method compared to raw statistics.
  • Hidden Score and Matrix Entropy achieve 0.96 AUROC on the mixed-domain 'all' dataset after normalization, up from ~0.57.
  • Identifies that different metrics target different errors: Matrix Entropy uniquely detects incoherence (repetition), while Hidden/Attention Scores fail to detect it (performing worse than random).
Breakthrough Assessment
7/10
Provides a significant practical fix (normalization) for a major failure mode (domain shift) in unsupervised hallucination detection. The analysis of what specific metrics capture is valuable, though the method relies on synthetic perturbations which may be harder to generate for non-numeric tasks.
×