SINdex: Semantic INconsistency Index for Hallucination Detection in LLMs

📝 Paper Summary

Hallucination suppression Uncertainty quantification

SINdex detects hallucinations by clustering multiple LLM responses based on semantic embeddings and calculating a novel inconsistency score that penalizes both cluster dispersion and lack of internal coherence.

Core Problem

Existing semantic entropy methods for hallucination detection rely on computationally expensive NLI models that often struggle with nuanced semantic similarity, leading to inaccurate uncertainty estimation.

Why it matters:

LLMs frequently generate plausible but factually incorrect information (hallucinations), posing risks in sensitive domains like medicine and law.
Current NLI-based methods are slow and computationally intensive, making them difficult to scale for real-time applications.
Strict entailment checks in NLI can fail to group semantically similar but syntactically distinct sentences (e.g., negations vs. assertions), leading to fragmented clusters and false uncertainty signals.

Concrete Example: For the question 'Which is the third planet from the sun?', responses like 'It is Earth' and 'I think it could be Earth' should be clustered together. NLI-based methods might separate them because 'I think...' does not strictly entail 'It is...', whereas SINdex's embedding-based clustering correctly groups them, recognizing the shared semantic core.

Key Novelty

SINdex (Semantic INconsistency Index)

Replaces NLI-based entailment checks with sentence embeddings and hierarchical agglomerative clustering to group responses based on semantic similarity.
Introduces a new inconsistency measure (SINdex) that adjusts standard entropy by weighing it against intra-cluster coherence (cosine similarity), ensuring that tight, consistent clusters yield lower uncertainty scores than loose, vague ones.

Architecture

Overview of the SINdex framework.

Evaluation Highlights

Achieves up to 9.3% improvement in AUROC for hallucination detection compared to state-of-the-art Semantic Entropy methods on benchmarks like TriviaQA and BioASQ.
Demonstrates a 60-fold speedup in processing time compared to NLI-based approaches when analyzing 200 generations, due to efficient embedding clustering.
Consistent performance gains across both open-book (SQuAD) and closed-book (TriviaQA, NQ) QA datasets using Llama-2-7b-chat.

Breakthrough Assessment

7/10

Significant efficiency gains (60x) and solid performance improvements make this a practical contribution. It simplifies the semantic entropy pipeline by removing the heavy NLI dependency.

⚙️ Technical Details

Problem Definition

Setting: Black-box hallucination detection for Question Answering tasks using multiple stochastic samples.

Inputs: A question q and a set of P stochastically generated responses G = {g_1, ..., g_P} from an LLM.

Outputs: A scalar score (SINdex) representing the semantic inconsistency (uncertainty) of the model's responses.

Pipeline Flow

Generation: Prompt LLM P times to get responses
Embedding: Compute sentence embeddings for all responses
Clustering: Group embeddings using Hierarchical Agglomerative Clustering
Scoring: Calculate SINdex based on cluster proportions and internal coherence

System Modules

Response Generator

Generate multiple stochastic answers for the input question

Model or implementation: Llama-2-7b-chat-hf (or other LLMs)

Embedder

Convert question-answer pairs into dense vector representations

Model or implementation: Transformer-based sentence similarity model (e.g., all-MiniLM-L6-v2)

Clusterer

Group semantically similar responses

Model or implementation: Hierarchical Agglomerative Clustering (Average Linkage)

SINdex Calculator

Compute the final inconsistency score

Model or implementation: Mathematical formula (Equation 10)

Novel Architectural Elements

Replacement of NLI-based graph connected components with continuous space Hierarchical Agglomerative Clustering for response grouping.
Integration of intra-cluster coherence (cosine similarity) directly into the entropy calculation (SINdex) to weigh the 'quality' of semantic clusters.

Modeling

Base Model: Llama-2-7b-chat-hf (used for generating responses to be analyzed)

Compute: Evaluation only. Scalability experiments run on A100 GPU.

Comparison to Prior Work

vs. Semantic Entropy: SINdex uses embeddings + clustering instead of NLI, making it 60x faster and robust to syntactic variations NLI misses.
vs. Naive Entropy: SINdex operates on semantic clusters rather than token distributions, capturing meaning-level uncertainty.
vs. Lexical Similarity: SINdex captures semantic equivalence even when wording is entirely different, which lexical metrics fail to do.

Limitations

Relies on the quality of the sentence embedding model; poor embeddings will lead to poor clustering.
The threshold for clustering (0.05) is a hyperparameter that might need tuning for different domains.
Does not fix the hallucinations, only detects them.
Performance depends on the number of generations (P), though shown to be robust.

Reproducibility

Code is not provided in the paper. The method relies on standard libraries (likely scikit-learn for clustering, HuggingFace for embeddings). Hyperparameters like distance threshold (0.05) and linkage (average) are specified.

📊 Experiments & Results

Evaluation Setup

QA tasks where the model generates multiple answers, and the goal is to predict if the generation is a hallucination (incorrect answer).

Benchmarks:

TriviaQA (Open-domain QA)
SQuAD (Reading Comprehension QA)
BioASQ (Biomedical QA)
Natural Questions (NQ) (Open-domain QA)

Metrics:

AUROC (Area Under the Receiver Operating Characteristic)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
SINdex consistently outperforms baselines in detecting hallucinations across various datasets, measured by AUROC.
TriviaQA	AUROC	0.835	0.854	+0.019
SQuAD	AUROC	0.782	0.801	+0.019
BioASQ	AUROC	0.741	0.810	+0.069
Natural Questions (NQ)	AUROC	0.765	0.793	+0.028
Scalability analysis shows dramatic speed improvements over NLI-based methods.
Runtime Analysis (200 generations)	Runtime (seconds)	300	5	-295

Experiment Figures

Comparison of clustering quality between Hierarchical Agglomerative Clustering (used by SINdex) and Bidirectional NLI Clustering.

Scalability analysis comparing runtime of Clustering vs NLI-based methods as the number of generations increases.

Main Takeaways

SINdex provides superior hallucination detection performance compared to Semantic Entropy across all tested benchmarks.
The method is significantly faster (60x) than NLI-based approaches, enabling scalability to large numbers of generated responses.
Ablation studies confirm that both the clustering method (Hierarchical vs NLI) and the inconsistency measure (SINdex vs Standard Entropy) contribute to the performance gains.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models and Hallucination
Basic knowledge of clustering algorithms (Hierarchical Agglomerative Clustering)
Familiarity with Semantic Entropy and NLI (Natural Language Inference)

Key Terms

SINdex: Semantic INconsistency Index—the proposed measure that combines cluster entropy with intra-cluster cosine similarity to quantify hallucination risk.

Semantic Entropy: A method to estimate uncertainty by grouping semantically equivalent answers and calculating the entropy over these meaning clusters.

NLI: Natural Language Inference—determining if one sentence entails (implies) another. Often used in prior work to cluster answers.

Hierarchical Agglomerative Clustering: A bottom-up clustering method where each data point starts as its own cluster and pairs are merged iteratively based on similarity.

AUROC: Area Under the Receiver Operating Characteristic curve—a metric used to evaluate the performance of a binary classifier (detecting hallucination vs. correct).

BioASQ: A biomedical question answering dataset used as a benchmark.

TriviaQA: A reading comprehension dataset containing trivia questions and evidence documents.

SQuAD: Stanford Question Answering Dataset—a reading comprehension benchmark.

NQ: Natural Questions—a dataset of questions from Google search logs.