Evaluation Setup
Benchmarking various LLMs on FaithQA dataset across 4 tasks (Fact QA, Creative Writing, Response Evaluation, Content Analysis)
Benchmarks:
- FaithQA (Intent Hallucination Evaluation (Omission & Misinterpretation)) [New]
Metrics:
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| FaithQA (subset) |
Correlation with Human Judgment |
Not reported in the paper |
Not reported in the paper |
-
|
| FaithQA |
Number of queries |
0 |
20068 |
+20068
|
Main Takeaways
- Intent hallucination is a common issue even for state-of-the-art models, not just smaller models
- The phenomenon stems primarily from omission (ignoring query parts) or misinterpretation (hallucinating requirements)
- LLM-as-a-judge baselines tend to be biased when evaluating intent, whereas the proposed decomposition-based Constraint Score aligns better with human labels
- Increasing query complexity (more constraints) correlates with a higher rate of intent hallucination