hidden knowledge: Factual information encoded in a model's parameters that is not expressed in its generated outputs (external signals)
external knowledge: Knowledge measured using observable signals like token-level probabilities (e.g., P(a|q))
internal knowledge: Knowledge measured using intermediate computations, such as hidden state representations accessed via a probe
K score: A metric quantifying knowledge as the fraction of (correct, incorrect) answer pairs where the correct one is ranked higher by a scoring function
K* score: A binary metric indicating 'perfect knowledge', where the model ranks every correct answer higher than every incorrect answer for a specific question
probing classifier: A simple linear model trained on LLM hidden states to predict properties (here, correctness) of the input
tip of the tongue: A cognitive state where a subject knows a fact but cannot retrieve or produce the word; applied here to LLMs knowing an answer but failing to generate it
greedy decoding: A generation strategy where the model always selects the highest-probability token at each step
LLM judge: Using a strong language model to evaluate the correctness of answers generated by another model
SFT: Supervised Fine-Tuning—training a model on labeled examples
inference scaling: Improving model performance at test time by using more compute, typically via sampling multiple outputs and selecting the best one