Beyond Facts: Evaluating Intent Hallucination in Large Language Models

📝 Paper Summary

Hallucination suppression Metrics and evaluation

The authors introduce 'Intent Hallucination' to describe when LLMs fail to address query constraints, proposing the FaithQA benchmark and Constraint Score metric to evaluate this non-factual hallucination type.

Core Problem

Current hallucination research focuses on factual errors, overlooking 'Intent Hallucination' where LLMs omit or misinterpret constraints in complex queries even if the output is factually correct.

Why it matters:

As users provide increasingly complex multi-condition queries to advanced LLMs, partial satisfaction of intents becomes a major failure mode
Existing metrics (factual precision, recall) cannot detect when a model ignores a specific constraint (e.g., 'write a poem') while remaining factually accurate
There is no existing benchmark tailored to identify the fundamental causes of intent hallucination (omission and misinterpretation)

Concrete Example: Query: 'Write a poem about Elon Musk born in South Africa.' Model response: 'Elon Musk was born in South Africa...' (Factually correct, but fails the 'poem' constraint). Existing factual metrics would score this high, missing the intent failure.

Key Novelty

Intent Hallucination Framework & Constraint Score

Decomposes complex queries into 'Intent Constraints' (mandatory, important, optional) derived from semantic roles (subject, action, context)
Defines Intent Hallucination specifically as the omission or misinterpretation of these constraints, distinct from factual fabrication
Introduces FaithQA, a dataset of 20,068 problems designed to elicit omission and misinterpretation in both query-only and RAG settings

Evaluation Highlights

Constraint Score metric aligns closer to human judgment for intent hallucination compared to standard LLM-as-a-judge baselines
Intent hallucination is prevalent even in state-of-the-art models, with error rates increasing as query complexity (number of constraints) rises
In RAG settings, LLMs frequently fail to detect missing information (misinterpretation), often hallucinating answers instead of refusing to answer

Breakthrough Assessment

8/10

Significant conceptual contribution by formalizing non-factual hallucination. The benchmark is large-scale and the metric addresses a blind spot in current evaluation, though the reliance on LLMs for scoring introduces some circularity.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of text generation consistency with user query constraints

Inputs: User query q containing multiple conditions/constraints

Outputs: Generated response y and a scalar Constraint Score

Pipeline Flow

Preliminary Assessment (check for sufficient info)
Semantic Role Identification (extract subject/action/context)
Constraint Set Extraction (categorize into mandatory/important/optional)
Constraint Scoring (evaluate response against constraints)

System Modules

Constraint Extractor (Metric Pipeline)

Decompose query into atomic intent constraints

Model or implementation: LLM-based extraction (specific model not extracted from text)

Constraint Scorer (Metric Pipeline)

Verify if a generated response satisfies each extracted constraint

Model or implementation: LLM-based verifier

Novel Architectural Elements

Hierarchical constraint weighting system (Mandatory vs. Important vs. Optional) for hallucination scoring
Query-centric evaluation pipeline that focuses on constraint recall rather than factual precision

Modeling

Base Model: Evaluates various LLMs (model specific to the metric implementation is not explicitly named in extracted text, likely GPT-4 class)

Comparison to Prior Work

vs. HaluEval/FELM: FaithQA focuses on query alignment (intent) rather than factual accuracy against world knowledge
vs. InfoBench: FaithQA specifically targets 'hallucination' via omission/misinterpretation rather than general instruction following
vs. FaithEval: FaithQA shifts focus from context alignment (is the answer in the doc?) to query alignment (did the model address the user's specific constraints?)

Limitations

Relies on an LLM to extract and verify constraints, which may introduce its own biases or errors
Constraint extraction might struggle with highly ambiguous or implicit user intents
Evaluation is currently limited to English language queries

Reproducibility

Benchmark FaithQA contains 20,068 problems. The prompt template for constraint mapping is provided in Appendix D.2. Code availability is not explicitly stated in the main text.

📊 Experiments & Results

Evaluation Setup

Benchmarking various LLMs on FaithQA dataset across 4 tasks (Fact QA, Creative Writing, Response Evaluation, Content Analysis)

Benchmarks:

FaithQA (Intent Hallucination Evaluation (Omission & Misinterpretation)) [New]

Metrics:

Constraint Score (CS)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
FaithQA (subset)	Correlation with Human Judgment	Not reported in the paper	Not reported in the paper	-
FaithQA	Number of queries	0	20068	+20068

Main Takeaways

Intent hallucination is a common issue even for state-of-the-art models, not just smaller models
The phenomenon stems primarily from omission (ignoring query parts) or misinterpretation (hallucinating requirements)
LLM-as-a-judge baselines tend to be biased when evaluating intent, whereas the proposed decomposition-based Constraint Score aligns better with human labels
Increasing query complexity (more constraints) correlates with a higher rate of intent hallucination

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM Hallucination (factual vs. non-factual)
Retrieval-Augmented Generation (RAG) workflows
Semantic Role Labeling (SRL) concepts

Key Terms

Intent Hallucination: When an LLM generates a response that deviates from the user's query intent by omitting required constraints or misinterpreting them (inventing new constraints)

Constraint Score: An automatic evaluation metric that decomposes a query into atomic constraints and calculates a weighted score based on how many are satisfied

FaithQA: A new benchmark dataset of 20,068 queries designed to test LLMs on omission (Fact QA, Creative Writing) and misinterpretation (RAG scenarios)

SRL: Semantic Role Labeling—a process to extract fundamental components of a sentence (subject, action, context) used here to identify constraints

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents