RLHF: Reinforcement Learning from Human Feedback—aligning AI models by training them to maximize a reward signal derived from human preferences
Spurious Correlation: A statistical association between two variables (e.g., length and quality) that is not causal, often leading models to learn incorrect shortcuts
Counterfactual Invariance: The property where a model's prediction remains the same even if a specific attribute (like length) is hypothetically changed, provided the core content remains constant
MMD: Maximum Mean Discrepancy—a statistical measure used to determine if two probability distributions are different; used here to force independence between representations and spurious factors
Reward Hacking: When an AI agent exploits flaws in the reward function to get a high score without actually achieving the intended goal (e.g., writing gibberish just to be long)
SFT: Supervised Fine-Tuning—the initial phase of training where the model learns to mimic high-quality human demonstrations
Bradley-Terry Model: A probability model used to predict which of two items (responses) is preferred, based on their latent reward scores
RKHS: Reproducing Kernel Hilbert Space—a mathematical space of functions used in the calculation of MMD