SAC3: Reliable Hallucination Detection in Black-Box Language Models via Semantic-aware Cross-check Consistency

📝 Paper Summary

Hallucination Detection Uncertainty Estimation

SAC3 improves hallucination detection in black-box LLMs by extending self-consistency checks with semantically equivalent question perturbations and cross-model verification to identify consistently wrong answers.

Core Problem

Existing self-consistency methods fail when LLMs are consistently wrong (question-level hallucination) or when a specific model architecture is prone to specific errors (model-level hallucination).

Why it matters:

Self-consistency alone is insufficient because high confidence/consistency does not guarantee factuality (e.g., an LLM might consistently answer that pi is smaller than 3.2)
Token-level log probabilities required for some uncertainty metrics are unavailable in commercial black-box APIs like ChatGPT
Reliable detection is critical for deploying LLMs in high-stakes domains where factual accuracy is paramount

Concrete Example: When asked 'Is pi smaller than 3.2?', ChatGPT consistently answers 'No' (incorrectly). A standard self-consistency check sees consistent answers and falsely labels it factual. SAC3 perturbs the question ('Is 3.2 greater than pi?') and checks across models to expose the error.

Key Novelty

Semantic-Aware Cross-check Consistency (SAC3)

Perturbs the input question into semantically equivalent variants to check if the model's answers remain consistent across phrasing changes (tackling question-level fixation)
Introduces a second 'verifier' LLM to cross-check answers, identifying cases where the target model might be confidently wrong due to its specific training biases

Architecture

The 3-stage pipeline of SAC3: Question Perturbation, Cross-Checking (Model & Question), and Score Calculation.

Evaluation Highlights

Achieves 99.4% AUROC on the Prime Number classification task, outperforming the self-consistency baseline (65.9%) by +33.5 points
Improves hallucination detection on open-domain generation (HotpotQA-halu) to 88.0% AUROC compared to 74.2% for the baseline
Demonstrates robustness across model scales, with SAC3-Q outperforming self-consistency on GPT-3.5, GPT-4, and PaLM 2

Breakthrough Assessment

8/10

Significantly exposes the failure modes of standard self-consistency (consistent hallucinations) and provides a practical, black-box compatible solution with large empirical gains.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of model responses as Factual or Hallucinated based on consistency scores derived from sampling.

Inputs: Original question x0 and original response s0 from Target Model T

Outputs: A consistency score Z_SAC3 and a binary prediction (Factual/Non-factual)

Pipeline Flow

Stage 1: Question Perturbation (generate k semantically equivalent questions)
Stage 2: Response Sampling (Target & Verifier models generate answers for original & perturbed questions)
Stage 3: Consistency Scoring (Calculate semantic equivalence between original answer and all sampled answers)

System Modules

Question Perturber

Generate and validate semantically equivalent variations of the input question

Model or implementation: Target LLM (e.g., gpt-3.5-turbo)

Target/Verifier Sampler

Generate multiple stochastic responses for the original and perturbed questions

Model or implementation: Target T (e.g., GPT-3.5) and Verifier V (e.g., Falcon-7b)

Consistency Scorer

Compute semantic equivalence between the original answer and sampled answers to derive a final score

Model or implementation: Target LLM (used as a judge)

Novel Architectural Elements

Cross-question consistency matrix: Evaluating consistency across a matrix of perturbed inputs rather than just repeated sampling of one input
Integrated Target-Verifier scoring: A weighted combination formula (Equation 8) merging self-consistency, cross-question, and cross-model scores

Modeling

Base Model: gpt-3.5-turbo (Target)

Compute: Inference-only. For SAC3-all: 1 Target call (generation) + 2*ns Verifier calls (if sample size = perturbation size). Costs scale with number of perturbations k.

Reproducibility

Code: https://github.com/intuit/sac3

📊 Experiments & Results

Evaluation Setup

Detecting hallucinations in classification QA and open-domain generation QA using black-box LMs.

Benchmarks:

Prime Number (Binary Classification QA)
Senator Search (Binary Classification QA)
HotpotQA-halu (Open-domain Generation QA (hallucination induced)) [New]
NQ-open-halu (Open-domain Generation QA (hallucination induced)) [New]

Metrics:

AUROC (Area Under ROC)
Accuracy (at threshold 0.5)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Classification QA results (Balanced Dataset): SAC3-Q significantly outperforms Self-consistency (SC2) on synthetic tasks where models are confidently wrong.
Prime Number	AUROC	65.9	99.4	+33.5
Senator Search	AUROC	56.1	99.7	+43.6
Classification QA results (100% Hallucinated Dataset): Evaluation on accuracy with a fixed threshold shows SAC3's robustness when the model is always wrong.
Prime Number	Accuracy	48.2	99.4	+51.2
Senator Search	Accuracy	29.6	97.0	+67.4
Open-domain Generation QA: SAC3 improves detection on realistic QA datasets, though margins are smaller than in synthetic tasks.
HotpotQA-halu	AUROC	74.2	88.0	+13.8
NQ-open-halu	AUROC	70.5	77.2	+6.7
Model Generalization: SAC3-Q maintains superiority over Self-consistency across GPT-4 and PaLM 2.
Senator Search	Accuracy	18.4	61.6	+43.2
HotpotQA-halu	Accuracy	75.8	82.8	+7.0

Experiment Figures

Histograms of hallucination scores for SC2 vs SAC3-Q on the Senator Search dataset.

Main Takeaways

Self-consistency is insufficient for factuality because models can be 'consistently wrong' (e.g., Prime Number task) or 'inconsistently right' (Senator Search task).
Question perturbations (SAC3-Q) are highly effective at breaking model fixation, causing the model to reveal inconsistency when the prompt is rephrased.
Cross-model checking (SAC3-M) helps when the target model has a strong specific bias, but relies on the verifier model being competent (e.g., Falcon-7b struggled on Senator Search).
Performance gains plateau after about 5 perturbed questions, suggesting a reasonable cost-accuracy trade-off is possible.

📚 Prerequisite Knowledge

Prerequisites

Self-consistency (sampling multiple reasoning paths)
Black-box LLM access (API-based interaction)
AUROC (Area Under the Receiver Operating Characteristic curve)

Key Terms

Self-consistency: The principle that correct answers from an LLM should be consistent across multiple stochastic samples; inconsistency implies hallucination

Question-level hallucination: When an LLM consistently generates incorrect answers for a specific question regardless of sampling temperature

Model-level hallucination: Hallucinations arising from specific biases or limitations of a single model architecture, detectable by comparing against a different model

Semantically equivalent perturbation: Rephrasing a question to differ lexically but retain the exact same meaning (e.g., 'Is X prime?' vs 'Does X have factors other than 1 and itself?')

Verifier LM: An additional, potentially smaller or different LLM used to cross-check the target model's responses

AUROC: Area Under the Receiver Operating Characteristic curve—a performance metric for classification problems at various threshold settings