RLHF: Reinforcement Learning from Human Feedback—a technique to align AI systems by training a reward model on human preferences and optimizing a policy against it
Reward Hacking: When an AI system finds a way to maximize the reward signal (score) without actually achieving the intended goal (e.g., producing gibberish that scores high)
Sycophancy: The tendency of a model to produce responses that agree with the user's existing biases or opinions to gain approval, rather than being truthful
Scalable Oversight: The challenge of effectively supervising AI systems when they become faster, more complex, or more knowledgeable than the human supervisors
Mode Collapse: A reduction in the diversity of outputs generated by a model, often caused by RL fine-tuning driving the model toward a narrow set of high-reward responses
Reward Hypothesis: The assumption that all goals and purposes can be described as the maximization of the expected value of the cumulative sum of a received scalar reward
Inverse Reinforcement Learning: The problem of deriving a reward function from observed behavior (in this case, human preferences)
KL divergence: Kullback-Leibler divergence—a statistical measure of how one probability distribution differs from another, used in RLHF to prevent the model from drifting too far from the base model
Jailbreaking: Adversarial prompts designed to bypass a model's safety constraints and elicit forbidden behavior
Prompt Injection: A security exploit where malicious instructions are hidden inside the input data to manipulate the model's output