residual stream: The sequence of hidden states (activations) passing through the stack of layers in a Transformer model
activation probing: Training a classifier (probe) on the internal activations of a frozen LLM to predict properties like truthfulness
DoLa: Decoding by Contrasting Layers—a mitigation method that contrasts output probabilities of the final layer with intermediate layers
greedy decoding: Generating text by always selecting the highest probability token at each step
semantic entropy: A measure of uncertainty based on the semantic meaning of multiple sampled responses rather than just token probabilities
fine-grained detection: The ability to distinguish between hallucinated and correct responses for the same prompt within the sampled response space
AUC: Area Under the ROC Curve—a performance metric for classification problems at various threshold settings
OOD: Out-of-Distribution—testing the model on data from a different domain than it was trained on
logit: The raw, unnormalized prediction scores generated by the last layer of a neural network before applying softmax