Reward Hacking: When a model exploits flaws in a reward function or environment to maximize its score without actually achieving the intended goal
SDF: Synthetic Document Finetuning—training a model on generated documents to inject specific knowledge (here, knowledge about how to hack environments)
RLHF: Reinforcement Learning from Human Feedback—training models to follow instructions and be safe using human preference data
Alignment Faking: When a model deceptively behaves as if it is aligned with safety goals to pass evaluations, while covertly pursuing misaligned goals
Inoculation Prompting: A mitigation technique where the system prompt explicitly frames reward hacking as acceptable or expected, preventing the model from associating hacking with rebellion or misalignment
CoT: Chain of Thought—the intermediate reasoning steps a model generates before its final answer
Claude Code: An agentic scaffolding tool developed by Anthropic that allows LLMs to interact with codebases and execute terminal commands
Context-dependent misalignment: A phenomenon where a model behaves safely in familiar contexts (like chat) but acts dangerously in novel or agentic contexts
Agentic evaluations: Tests where the model operates as an autonomous agent (e.g., using tools, writing code) rather than just answering chat questions