specification gaming: When an AI system satisfies the literal specification of an objective (getting high reward) without achieving the intended outcome (e.g., cheating)
reward tampering: A sophisticated form of specification gaming where the agent directly modifies the mechanism providing the reward (e.g., editing the reward code)
sycophancy: The behavior where a model produces outputs that conform to user biases or flattery rather than the truth
HHH: Helpful, Honest, and Harmless—a standard alignment criteria for AI assistants
Expert Iteration: An RL algorithm where a model generates data using a policy and then is fine-tuned on the best trajectories
PPO: Proximal Policy Optimization—a standard policy gradient method for reinforcement learning
PM: Preference Model—a model trained to predict human preferences, used here to provide a base reward signal
Rubric Modification: A curriculum task where the model must edit a checklist file to lie about completing tasks