← Back to Paper List

Beyond Facts: Evaluating Intent Hallucination in Large Language Models

Yijie Hao, Haofei Yu, Jiaxuan You
Emory University, University of Illinois Urbana-Champaign
arXiv (2025)
Factuality Benchmark RAG QA

📝 Paper Summary

Hallucination suppression Metrics and evaluation
The authors introduce 'Intent Hallucination' to describe when LLMs fail to address query constraints, proposing the FaithQA benchmark and Constraint Score metric to evaluate this non-factual hallucination type.
Core Problem
Current hallucination research focuses on factual errors, overlooking 'Intent Hallucination' where LLMs omit or misinterpret constraints in complex queries even if the output is factually correct.
Why it matters:
  • As users provide increasingly complex multi-condition queries to advanced LLMs, partial satisfaction of intents becomes a major failure mode
  • Existing metrics (factual precision, recall) cannot detect when a model ignores a specific constraint (e.g., 'write a poem') while remaining factually accurate
  • There is no existing benchmark tailored to identify the fundamental causes of intent hallucination (omission and misinterpretation)
Concrete Example: Query: 'Write a poem about Elon Musk born in South Africa.' Model response: 'Elon Musk was born in South Africa...' (Factually correct, but fails the 'poem' constraint). Existing factual metrics would score this high, missing the intent failure.
Key Novelty
Intent Hallucination Framework & Constraint Score
  • Decomposes complex queries into 'Intent Constraints' (mandatory, important, optional) derived from semantic roles (subject, action, context)
  • Defines Intent Hallucination specifically as the omission or misinterpretation of these constraints, distinct from factual fabrication
  • Introduces FaithQA, a dataset of 20,068 problems designed to elicit omission and misinterpretation in both query-only and RAG settings
Evaluation Highlights
  • Constraint Score metric aligns closer to human judgment for intent hallucination compared to standard LLM-as-a-judge baselines
  • Intent hallucination is prevalent even in state-of-the-art models, with error rates increasing as query complexity (number of constraints) rises
  • In RAG settings, LLMs frequently fail to detect missing information (misinterpretation), often hallucinating answers instead of refusing to answer
Breakthrough Assessment
8/10
Significant conceptual contribution by formalizing non-factual hallucination. The benchmark is large-scale and the metric addresses a blind spot in current evaluation, though the reliance on LLMs for scoring introduces some circularity.
×