Evaluation Setup
Hallucination detection on QA and text generation tasks.
Benchmarks:
- TriviaQA (Question Answering)
- SQuAD (Reading Comprehension/QA)
- Natural Questions (NQ) (Open-domain QA)
- TruthfulQA (Factuality evaluation)
Metrics:
- AUROC (Area Under Receiver Operating Characteristic Curve)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| SHINE achieves state-of-the-art hallucination detection performance across multiple datasets and models compared to unsupervised baselines. |
| TriviaQA |
AUROC |
0.81 |
0.88 |
+0.07
|
| SQuAD |
AUROC |
0.78 |
0.82 |
+0.04
|
| TruthfulQA |
AUROC |
0.68 |
0.83 |
+0.15
|
| TriviaQA |
AUROC |
0.73 |
0.88 |
+0.15
|
Main Takeaways
- Perturbing key entities reveals distinct patterns for hallucinations: Fabricated text has low KL divergence (insensitive to input), while Aligned text changes significantly.
- Misaligned text (contradictions) often shows *increased* probability for generated tokens when noise is added (positive Delta P).
- The method generalizes well across different LLM architectures (LLaMA2, LLaMA3, Mistral, Qwen) without retraining.