Evaluation Setup
Detection of hallucinations in references generated by LLMs for fact-checking claims
Benchmarks:
- AutoHall-Generated Dataset (Hallucination Detection) [New]
Metrics:
- AUC-ROC (Area Under Curve - Receiver Operating Characteristic)
- Statistical methodology: Not explicitly reported in the paper
Main Takeaways
- Hallucination rates are significant (20-30%) across both open-source (Llama-2) and closed-source (ChatGPT) models
- The proposed self-contradiction detection method consistently outperforms baselines (SelfCheckGPT variants) on the constructed datasets
- Specific domains like history, technology, and geography trigger higher hallucination rates compared to others