← Back to Paper List

Cost-Effective Hallucination Detection for LLMs

Simon Valentin, Jinmiao Fu, Gianluca Detommaso, Shaoyuan Xu, Giovanni Zappella, Bryan Wang
Amazon Web Services
arXiv (2024)
Factuality Benchmark QA

📝 Paper Summary

Hallucination suppression Confidence calibration
A framework for detecting hallucinations by calibrating and aggregating multiple confidence scores (multi-scoring), optimizing for detection performance under fixed computational budgets.
Core Problem
Existing hallucination detection methods lack comparative evaluation, often have prohibitive computational costs, and produce uncalibrated scores unsuitable for risk-aware production thresholds.
Why it matters:
  • Unreliable LLM outputs pose risks in critical applications (e.g., medical advice), requiring accurate risk quantification
  • Production settings have strict latency and cost constraints, making expensive detection methods (like sampling many responses) impractical
  • No single scoring method performs best across all datasets and models, creating a need for robust aggregation
Concrete Example: A user asks an LLM for medical advice. A single scoring method (e.g., SelfCheckGPT) might be confident but wrong due to model calibration issues or high cost constraints preventing sufficient sampling. The proposed multi-scoring approach combines this with cheaper signals (like P(True)) to flag the hallucination more reliably within budget.
Key Novelty
Cost-Effective Multi-Scoring for Hallucination Detection
  • Aggregates diverse hallucination scores (e.g., perplexity, self-contradiction, verbalized confidence) using logistic regression to leverage complementary signals
  • Applies state-of-the-art calibration (multicalibration) to raw scores to ensure probabilities reflect true hallucination rates
  • Solves a constrained optimization problem to select the best subset of scores that maximizes detection performance for a specific computational budget
Evaluation Highlights
  • Multi-scoring outperforms the best individual score by +4% AUC-ROC on average across summarization, QA, and fact-checking datasets
  • Cost-effective multi-scoring matches the performance of expensive methods (like SelfCheckGPT) while using significantly fewer LLM calls
  • Calibration significantly improves risk assessment, reducing Expected Calibration Error (ECE) compared to raw scores
Breakthrough Assessment
7/10
Provides a practical, production-oriented framework for combining existing methods. While it doesn't invent new fundamental scoring metrics, the cost-effective aggregation strategy is highly valuable for real-world deployment.
×