← Back to Paper List

(Im)possibility of Automated Hallucination Detection in Large Language Models

Amin Karbasi, Omar Montasser, John Sous, Grigoris Velegkas
Yale University, Toyota Technological Institute at Chicago
arXiv (2025)
Factuality RL

📝 Paper Summary

Theoretical limitations of LLMs Automated hallucination detection
Theoretical analysis proves automated hallucination detection is impossible when trained only on correct examples, but becomes achievable for all countable languages if trained with explicitly labeled negative examples.
Core Problem
It is unknown whether automated hallucination detection is inherently feasible or fundamentally impossible, particularly given that LLMs often struggle to detect their own hallucinations without external feedback.
Why it matters:
  • Hallucinations severely limit LLM trustworthiness in sensitive applications, raising ethical and safety concerns
  • Empirical attempts to use LLMs as detectors often fail without human labels, but a theoretical explanation for this difficulty has been missing
  • Understanding fundamental limits guides whether to focus on better model architecture or better data labeling processes (e.g., RLHF)
Concrete Example: Consider a language of even numbers. A detector trained only on positive examples (seeing 2, 4, 6...) cannot distinguish if the LLM's full output range is 'all even numbers' (correct) or 'all integers' (hallucinating odds), because the positive examples are consistent with both hypotheses.
Key Novelty
Equivalence of Hallucination Detection to Language Identification
  • Formalizes hallucination detection as a game where a detector observes correct examples from a target language K and interacts with an LLM generating a set G
  • Proves mathematically that detecting if G is a subset of K (no hallucination) is equivalent to the classical problem of Language Identification in the Limit
  • Demonstrates that while positive-only data leads to impossibility results, adding negative examples (labeled incorrect statements) makes detection possible for all countable languages
Evaluation Highlights
  • Proves detection with only positive examples is impossible for super-finite collections (e.g., all finite sets and at least one infinite set)
  • Proves detection with explicitly labeled negative examples is possible for ALL countable collections of languages
  • Establishes a theoretical equivalence: a collection admits a hallucination detector if and only if it is identifiable in the limit (Angluin's condition)
Breakthrough Assessment
9/10
Provides the first rigorous theoretical foundation explaining why current LLM self-correction fails and why RLHF/negative feedback is mathematically necessary, moving the field from empirical observation to theoretical certainty.
×