RLHF: Reinforcement Learning from Human Feedback—aligning AI models by training them to maximize a reward model learned from human preferences
reward hacking: When an AI exploits flaws in a reward model to get a high score without actually achieving the intended goal (e.g., writing long gibberish because length correlates with score)
spurious correlation: A statistical pattern that looks like a cause but isn't (e.g., 'longer answers are better' is a correlation, not a causal rule)
causal factors: Latent variables that are truly sufficient and necessary to determine the quality/reward of a response
gradient reversal layer: A network layer that flips the sign of gradients during backpropagation, used here to make the encoder unlearn information that the adversary tries to predict
information bottleneck: A technique that restricts the amount of information a representation can hold, forcing the model to keep only the most essential features
sycophancy: The tendency of a model to agree with the user's bias or prompt rather than being truthful
KL divergence: A statistical distance measuring how one probability distribution differs from another; used here as a regularizer
PPO: Proximal Policy Optimization—the standard reinforcement learning algorithm used to train the language model policy
SFT: Supervised Fine-Tuning—the initial phase of training a model on high-quality examples before RLHF
backbone: The pre-trained language model (e.g., Qwen) used to extract features from text
MMD: Maximum Mean Discrepancy—a statistical measure of the distance between two probability distributions