Evaluation Setup
Cross-domain hallucination detection: Train on one dataset (e.g., CoQA), test on mixed pool of 5 others.
Benchmarks:
- TriviaQA (Knowledge-intensive QA)
- CommonsenseQA (Commonsense reasoning)
- Belebele (Reading comprehension)
- CoQA (Conversational QA)
- Math (Mathematical reasoning)
- SVAMP (Math word problems)
Metrics:
- AUROC
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Main cross-domain generalization results showing SpikeScore's superior average AUROC across multiple LLMs compared to baselines. |
| Average across 6 datasets |
AUROC |
0.7397 |
0.7550 |
+0.0153
|
| Average across 6 datasets |
AUROC |
0.7602 |
0.7780 |
+0.0178
|
| Average across 6 datasets |
AUROC |
0.7186 |
0.7712 |
+0.0526
|
| RAG scenario evaluation (TriviaQA and RAGTruth) demonstrating robustness when applied to retrieval pipelines. |
| TriviaQA (RAG) |
AUROC |
0.7412 |
0.7731 |
+0.0319
|
| RAGTruth |
AUROC |
0.7208 |
0.7490 |
+0.0282
|
Main Takeaways
- SpikeScore consistently outperforms baselines in cross-domain settings, indicating that 'instability' is a domain-invariant feature of hallucination.
- The method scales well: performance gains increase with larger model sizes (e.g., Qwen3-14B), likely due to stronger self-correction mechanisms in larger models.
- Robust to RAG noise: SpikeScore remains effective even when hallucinations stem from imperfect retrieval, unlike baselines which degrade significantly.
- Theoretical analysis confirms that curvature-based scoring provides a probabilistic lower bound for separability between hallucinated and factual responses.