Evaluation Setup
Binary classification of LLM outputs as 'Hallucination' or 'Correct' based on geometric scores.
Benchmarks:
- Math (Integer multiplication) [New]
- History (Date retrieval (Year of event)) [New]
- Counting (Word count in sequence) [New]
Metrics:
- AUROC (Area Under Receiver Operating Characteristic)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance of raw geometric statistics on detecting factual incorrectness (Level 3 severity). Shows strong single-domain performance but collapse on mixed domains. |
| Math |
AUROC |
0.50 |
0.92 |
+0.42
|
| History |
AUROC |
0.50 |
0.75 |
+0.25
|
| All (Mixed Domain) |
AUROC |
0.50 |
0.57 |
+0.07
|
| Impact of Perturbation Normalization on mixed-domain detection (Level 1 Incorrectness). |
| All (Mixed Domain) |
AUROC |
0.56 |
0.96 |
+0.40
|
| All (Mixed Domain) |
AUROC |
0.55 |
0.89 |
+0.34
|
| Sensitivity to specific hallucination types (Level 3 severity on 'All' dataset). |
| All (Mixed Domain) |
AUROC |
0.50 |
0.96 |
+0.46
|
| All (Mixed Domain) |
AUROC |
0.50 |
0.99 |
+0.49
|
Main Takeaways
- Different geometric statistics capture different hallucination types: Matrix Entropy is uniquely sensitive to Incoherence (repetition), while Hidden/Attention scores are better for Irrelevance.
- Domain shift is a critical failure mode for geometric detectors; the variance in scores between domains (Math vs History) is larger than the variance between correct/incorrect answers.
- Perturbation Normalization effectively cancels out domain baselines, allowing a single detector to work across Math, History, and Counting tasks with high accuracy.
- Optimal layers for detection vary by domain (Layer 30-31 for Math, 14-16 for History), but normalization aligns performance at later layers.