SINdex: Semantic INconsistency Index—the proposed measure that combines cluster entropy with intra-cluster cosine similarity to quantify hallucination risk.
Semantic Entropy: A method to estimate uncertainty by grouping semantically equivalent answers and calculating the entropy over these meaning clusters.
NLI: Natural Language Inference—determining if one sentence entails (implies) another. Often used in prior work to cluster answers.
Hierarchical Agglomerative Clustering: A bottom-up clustering method where each data point starts as its own cluster and pairs are merged iteratively based on similarity.
AUROC: Area Under the Receiver Operating Characteristic curve—a metric used to evaluate the performance of a binary classifier (detecting hallucination vs. correct).
BioASQ: A biomedical question answering dataset used as a benchmark.
TriviaQA: A reading comprehension dataset containing trivia questions and evidence documents.
SQuAD: Stanford Question Answering Dataset—a reading comprehension benchmark.
NQ: Natural Questions—a dataset of questions from Google search logs.