spurious correlations: Statistical associations between variables that do not imply causation (e.g., surname predicting nationality) but are picked up by models as shortcuts
hallucination: Confident generation of incorrect or non-existent information by an LLM
refusal fine-tuning: Training a model to output a specific refusal token (e.g., 'I don't know') when it does not know the answer or is uncertain
linear probing: A method to inspect internal model representations by training a simple linear classifier on hidden states to predict truthfulness
logit entropy: A measure of uncertainty based on the probability distribution of the next token; high entropy usually implies uncertainty
self-consistency: A detection method that checks if a model generates the same answer across multiple sampling runs; high consistency usually implies higher confidence
kernel ridge regression: A statistical learning method used in the theoretical analysis to model how neural networks generalize versus memorize data
Jaccard similarity: A metric used here to measure the co-occurrence of entities in texts, serving as a proxy for the strength of spurious correlations in real-world data