RLHF: Reinforcement Learning from Human Feedback—a method to train language models using rewards learned from human preferences
KL divergence: A statistical measure of how one probability distribution differs from another, used here as a penalty to prevent the AI from drifting too far from its original training
Goodhart's Law: The adage that 'when a measure becomes a target, it ceases to be a good measure,' implying that optimizing a proxy metric often degrades the true goal
Heavy-tailed distribution: A probability distribution where extreme values (outliers) are much more likely than in a normal (Gaussian) distribution; tails decay slower than exponentially
Light-tailed distribution: A distribution where extreme values are very rare; tails decay exponentially or faster
Catastrophic Goodhart: A specific failure mode defined by the author where optimizing a proxy reward leads to zero improvement (or degradation) in true utility despite satisfying regularization constraints
DMRMDP: Deterministic-transition MDP with Markovian returns—a theoretical model used to represent Language Model generation where transitions (token additions) are deterministic and reward comes at the end