Hallucination: When an LLM generates content that is nonsensical or unfaithful to the source/reality
Zero-resource: Methods that do not require external databases, ground truth documents, or human labeling during inference
Black-box: Systems where only the text output is accessible, without access to internal weights, gradients, or token probabilities
Grey-box: Systems where the user has access to the output probability distribution (logits) but not necessarily full weights
MQAG: Multiple-choice Question Answering and Generation—a framework used here to check if samples answer generated questions consistently
BERTScore: A metric usually used for text similarity; here used to measure if a sentence is semantically present in other samples
AUC-PR: Area Under the Precision-Recall Curve—a performance metric suitable for imbalanced classification tasks like error detection
WikiBio: A dataset of Wikipedia biographies used here to generate synthetic articles for evaluating hallucination
NLI: Natural Language Inference—determining if a hypothesis is entailed by, neutral to, or contradicts a premise