RLHF: Reinforcement Learning from Human Feedback—a method to align LLMs using a reward model trained on human preferences
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only low-rank matrices added to the model weights
Overoptimization: When maximizing a proxy reward model's score leads to a decrease in the true underlying objective (human preference)
Nuclear Norm: The sum of the singular values of a matrix, used here as a convex surrogate for matrix rank to measure and encourage diversity
ECE: Expected Calibration Error—a metric measuring the difference between a model's confidence and its actual accuracy
OOD: Out-of-Distribution—data samples that are significantly different from the training data, where models often make high-confidence errors
Gold Reward: The score from a superior, larger reward model used as a ground-truth proxy for evaluation
KL Divergence: A statistical measure of how one probability distribution differs from another, used to keep the tuned model close to the original