P(True): The probability a model assigns to the option 'True' when asked if a specific sample answer is correct
P(IK): Probability that 'I Know'—the probability a model assigns to the proposition that it will answer a given question correctly
Calibration: The alignment between a model's predicted probabilities and the actual frequency of correctness (e.g., if a model predicts 70% confidence, it should be right 70% of the time)
ECE: Expected Calibration Error—a metric summarizing calibration by averaging the absolute difference between predicted probabilities and actual accuracy across bins
AUROC: Area Under the Receiver Operating Characteristic—a metric measuring how well a model can distinguish between two classes (e.g., correct vs. incorrect answers) regardless of calibration threshold
RLHF: Reinforcement Learning from Human Feedback—a method to fine-tune language models using human preferences
Chain-of-thought: A prompting technique where the model generates intermediate reasoning steps before the final answer
Brier Score: A proper scoring rule that measures the accuracy of probabilistic predictions; lower scores indicate better calibration and discrimination
OOD: Out-of-Distribution—evaluating a model on data types or tasks it was not explicitly trained on
Value Head: An additional neural network layer added to a language model to predict a scalar value (like P(IK)) rather than the next token
Unit temperature: Sampling with temperature T=1, meaning the model selects tokens based directly on their learned probabilities without sharpening or smoothing
None of the above: A multiple-choice option explicitly tested in the paper, which was found to degrade model performance and calibration