RLHF: Reinforcement Learning from Human Feedback—a method to align language models by training a reward model on human preferences and optimizing the policy via RL
DPO: Direct Preference Optimization—an algorithm that optimizes the policy directly from preference data without training a separate explicit reward model
Bradley-Terry model: A statistical model predicting the probability that one item is preferred over another based on their latent reward scores
Overoptimization: Also known as reward hacking; when a model exploits the reward function to get a high score without actually improving generation quality
Verbosity bias: The tendency of language models and reward models to irrationally favor longer responses regardless of quality
LC-win rate: Length-Controlled win rate—an evaluation metric that adjusts for response length to prevent longer answers from automatically winning
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes the pre-trained model weights and injects trainable rank decomposition matrices
MLE: Maximum Likelihood Estimation—a method for estimating the parameters of a statistical model
KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a second, reference probability distribution