verbalized probability: Uncertainty expressed by the model in natural language words or numbers (e.g., 'High confidence', '90%') rather than extracted from internal weights
logits: The raw, unnormalized scores output by the last layer of a neural network before applying softmax to get probabilities
epistemic uncertainty: Uncertainty about the truth of a claim or knowledge itself, rather than uncertainty about which word comes next (aleatoric/token uncertainty)
CalibratedMath: A suite of arithmetic tasks introduced in this paper to test calibration generalization across different difficulty levels and question formats
MAD: Mean Absolute Deviation—a metric measuring the average absolute difference between the model's predicted confidence and its actual accuracy within binned groups
MSE: Mean Squared Error—a proper scoring rule (equivalent to Brier Score) measuring the squared difference between predicted probability and the binary ground truth (correct/incorrect)
greedy decoding: A generation strategy where the model always selects the highest-probability token at each step
Expected Value decoding: A decoding method used in few-shot experiments where the output is a weighted sum of the top-5 token values, used to get a continuous confidence estimate
answer logit: The normalized log-probability assigned by the model to the tokens constituting its answer
indirect logit: The log-probability of the token 'True' when the model is asked to evaluate 'True/false' on its own generated answer
linear probe: A simple linear classifier trained on top of a frozen model's internal representations (embeddings) to predict a target label