← Back to Paper List

Chainpoll: A high efficacy method for LLM hallucination detection

Robert Friel, Atindriyo Sanyal
Galileo Technologies Inc.
arXiv (2023)
Factuality Benchmark RAG QA

📝 Paper Summary

Hallucination suppression Metrics and evaluation
ChainPoll detects hallucinations by aggregating boolean judgments from a chain-of-thought-enabled LLM across multiple samples, outperforming existing metrics on a new, harder benchmark suite called RealHall.
Core Problem
Existing hallucination detection benchmarks rely on easy tasks or weak models that don't reflect modern LLM capabilities, and existing detection metrics are often inaccurate, expensive, or limited to specific domains.
Why it matters:
  • Hallucinations remain a primary blocker for enterprise adoption of LLMs due to trust and safety concerns
  • Current academic benchmarks use outdated models (e.g., GPT-2), making them irrelevant for evaluating SOTA models like GPT-4
  • Users need efficient, explainable metrics to monitor production systems without the high cost of human labeling or heavy GPU usage
Concrete Example: In the 'RealHall Closed' benchmark (using COVID-QA), a model might claim a study describes 'severe hospitalized cases' based on documents that only mention 'preventive measures.' Existing metrics often miss this subtle inconsistency, whereas ChainPoll catches it by reasoning through the documents.
Key Novelty
ChainPoll (Chain-of-Thought Polling)
  • Combines Chain-of-Thought (CoT) prompting with 'polling' (aggregating results from multiple inference runs) to improve judgment reliability
  • Uses a detailed prompt that forces the judge LLM to explain its reasoning before outputting a boolean decision, rather than predicting a scalar score
  • Introduces RealHall, a curated benchmark suite designed specifically to challenge modern SOTA LLMs, unlike previous benchmarks based on weaker models
Evaluation Highlights
  • ChainPoll achieves 0.781 aggregate AUROC across RealHall, outperforming the next best method (SelfCheck-Bertscore) by ~11%
  • ChainPoll uses only ~1/4 the inference compute of SelfCheck-BertScore while delivering higher accuracy
  • ChainPoll-Adherence reaches 0.789 AUROC on closed-domain tasks, beating TRUE (0.593) and G-Eval (0.584) by significant margins
Breakthrough Assessment
7/10
Strong practical contribution with a new SOTA metric and a much-needed modernization of hallucination benchmarks. However, the core technique is a refinement of existing CoT/Ensembling ideas rather than a fundamental architectural shift.
×