← Back to Paper List

Black-Box Hallucination Detection via Consistency Under the Uncertain Expression

Seongho Joo, Kyungmin Min, Jahyun Koo, Kyomin Jung
Not explicitly listed in text
arXiv (2025)
Factuality QA Benchmark

📝 Paper Summary

Hallucination Detection Uncertainty Estimation Black-Box Methods
The paper proposes a hallucination detection metric that measures the consistency between a standard LLM response and a response generated with a prompt explicitly expressing uncertainty, finding that factual answers remain consistent while hallucinations shift.
Core Problem
LLMs frequently generate non-factual 'hallucinated' responses, but existing detection methods often require restricted internal states (like token probabilities) or expensive external resources (like Wikipedia retrieval).
Why it matters:
  • Real-world LLM deployments often expose only black-box APIs, making white-box detection methods using internal probabilities unusable.
  • External knowledge bases may lack coverage for long-tail queries or specific domains.
  • Existing black-box methods (e.g., SelfCheckGPT) are computationally expensive, requiring ~10 sampled responses to estimate consistency.
Concrete Example: When asked 'How many acts is the ballet Rita Sangalli premiered in...?', a model might confidently hallucinate an answer. The proposed method detects this by re-asking with 'I am not sure but it could be', causing the model to change its answer if the original was a hallucination.
Key Novelty
Consistency Under Uncertain Expression (Black-Box Metric)
  • Leverages the observation that LLMs are 'consistent' when factual but 'inconsistent' when hallucinating if forced to express uncertainty.
  • Uses a single additional inference step with a prompt like 'Answer: I am not sure but it could be' to trigger this divergence for non-factual answers.
  • Avoids the high cost of multiple sampling (e.g., 10+ generations) required by previous black-box consistency checks.
Architecture
Architecture Figure Figure 1 (implied)
Conceptual workflow of the proposed detection method (text-based reconstruction)
Evaluation Highlights
  • Factual responses show high consistency (~87-90%) regardless of the prompt expression used.
  • Non-factual responses show significantly lower consistency (~32-48%) when prompted with uncertainty or certainty expressions.
  • The method effectively discriminates factuality using just two responses (original + perturbed), whereas baselines often need 10+ samples.
Breakthrough Assessment
7/10
Simple, effective, and efficient. It significantly reduces the compute cost of black-box hallucination detection (2 calls vs 10+) while offering a novel prompt-based mechanism.
×