Evaluation Setup
Evaluate LLMs and detection methods on the DiaHalu benchmark for hallucination identification
Benchmarks:
- DiaHalu (Dialogue-level Hallucination Detection) [New]
Metrics:
- Accuracy
- F1 score
- Precision
- Recall
- Statistical methodology: Fleiss's Kappa reported for inter-annotator agreement
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
ฮ |
| Analysis of hallucination rates across different dialogue domains in the constructed benchmark. |
| DiaHalu |
Hallucination Rate (Knowledge-grounded) |
N/A |
32.8% |
N/A
|
| DiaHalu |
Hallucination Rate (Reasoning) |
N/A |
35.2% |
N/A
|
| DiaHalu |
Hallucination Rate (Chit-Chat) |
N/A |
12.4% |
N/A
|
| DiaHalu |
Hallucination Rate (Task-oriented) |
N/A |
19.6% |
N/A
|
Main Takeaways
- Hallucinations are domain-dependent: Reasoning and Knowledge-grounded tasks trigger significantly more hallucinations than Chit-Chat.
- Faithfulness issues (Incoherence, Irrelevance) are pervasive in Chit-Chat and Task-oriented dialogues, challenging the assumption that hallucination is purely a factuality problem.
- Existing detection methods (like simple prompting or uncertainty metrics) struggle with the subtle context-dependent hallucinations in DiaHalu, proving it is a challenging benchmark.