DPO: Direct Preference Optimization—an algorithm for aligning language models to human preferences by solving for the optimal policy in closed form, avoiding a separate reward model.
ECE: Expected Calibration Error—a metric measuring the average difference between a model's predicted confidence and its actual accuracy.
Confidence Drift: The phenomenon where a model's probability estimates shift away from true correctness probabilities during training (e.g., becoming overconfident).
Logits: The raw, unnormalized scores output by the final layer of a neural network before applying softmax.
Temperature Scaling: A post-hoc calibration technique that divides logits by a scalar value T to adjust the entropy of the output distribution.
RCFT: Regularized Calibration-Aware Finetuning—a baseline method that applies calibration as a separate supervised fine-tuning phase after alignment.
Probability Margin: The difference in predicted probability between the correct token and the highest-probability incorrect token.
Bayes-optimal: A decision rule that minimizes the expected loss (or maximizes expected utility) given the true posterior probability distribution.