Evaluation Setup
Zero-shot hallucination detection on QA tasks
Benchmarks:
- Books (Question Answering (Entity-centric))
- Movies (Question Answering (Entity-centric))
- Global Country Information (GCI) (Question Answering (Geographical/Demographic))
Metrics:
- AUC (Area Under Curve)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| AGSER consistently outperforms baselines across different LLMs on hallucination detection AUC. |
| Average across Books/Movies/GCI |
AUC |
0.850 |
0.886 |
+0.036
|
| Average across Books/Movies/GCI |
AUC |
0.867 |
0.895 |
+0.028
|
| Average across Books/Movies/GCI |
AUC |
0.880 |
0.889 |
+0.009
|
| Average across Books/Movies/GCI |
AUC |
0.824 |
0.891 |
+0.067
|
| Ablation studies confirm the necessity of both attentive and non-attentive query components. |
| Average across Books/Movies/GCI |
AUC |
0.575 |
0.886 |
+0.311
|
| Average across Books/Movies/GCI |
AUC |
0.877 |
0.886 |
+0.009
|
Main Takeaways
- Attentive queries are the primary driver of detection performance, but non-attentive queries provide a necessary baseline for comparison (background noise)
- Mean pooling of attention across layers works better than using only the last layer or middle layer, suggesting hallucinations leave traces throughout the depth of the model
- The method is robust across different model families (Llama vs Qwen) and sizes (7B to 14B)
- Efficiency gain is substantial: 3 inference passes vs 5+ for stochastic baselines makes it more practical for real-time applications