proxy model: A separate language model (e.g., LLaMA) used to evaluate the probabilities of text generated by a target black-box LLM (e.g., GPT-3)
hallucination score: A metric quantifying the likelihood that a generated token or sentence is hallucinated, typically based on low probability or high entropy
exposure bias: A discrepancy where a model generates text based on its own previous (potentially erroneous) outputs during inference, unlike training where it sees ground truth
token IDF: Inverse Document Frequency—a measure of how rare a token is across a corpus, used here to normalize probabilities for rare but correct words
AUC-PR: Area Under the Precision-Recall Curve—a performance metric for binary classification, suitable for imbalanced datasets like hallucination detection
SelfCheckGPT: A baseline method that detects hallucinations by checking consistency across multiple sampled responses from the same LLM