Evaluation Setup
Hallucination detection across diverse tasks
Benchmarks:
- Natural Questions (Instruction following / QA)
- MATH-500 (Math reasoning)
- SQuAD (Reading comprehension / QA)
- 7 other diverse benchmarks (Various)
Metrics:
- AUROC (Area Under the Receiver Operating Characteristic curve)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
ฮ |
| SQuAD |
Correlation (Pearson/Spearman implied) |
Lower (implied) |
0.84 |
Not reported
|
| MATH-500 |
Correlation (Pearson/Spearman implied) |
Lower (implied) |
0.88 |
Not reported
|
Main Takeaways
- Hallucinations are rarely pure; they are often mixtures of data-driven bias and reasoning-driven instability, but the dominant factor varies by task (e.g., MATH-500 is 98.1% reasoning errors, Natural Questions is 88.9% reasoning).
- The unified Hallucination Risk Bound effectively decomposes risk: det(K) works best for factual errors, while spectral norms work best for reasoning slips.
- HalluGuard consistently achieves state-of-the-art performance across 10 benchmarks and 9 models, validating the theoretical framework.