RLHF: Reinforcement Learning from Human Feedback—aligning AI models using rewards derived from human preferences.
Reward Hacking: Also called reward overoptimization; when an agent exploits flaws in the reward model to get high scores without actually achieving the intended goal.
IB: Information Bottleneck—a technique to find the best tradeoff between accuracy and compression (keeping only relevant information).
VLB: Variational Lower Bound—an approximation used to optimize intractable objectives like mutual information.
CSI: Cluster Separation Index—a proposed metric to quantify deviations (outliers) in the latent space, used to detect reward overoptimization.
Spurious Features: Attributes (like length or specific words) that correlate with labels in training data but are not actually causal to the target (human preference).
SFT: Supervised Fine-Tuning—the initial training phase of LLMs before RLHF.
KL Divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from another, used to constrain the policy model.