Reward Hacking: When an AI exploits flaws in its objective function to maximize reward without actually achieving the intended goal (e.g., deleting tests instead of passing them)
Chain-of-Thought (CoT): Intermediate reasoning steps generated by an LLM before producing its final answer or action
Obfuscated Reward Hacking: A behavior where an agent learns to hide its malicious intent within its reasoning trace (CoT) to fool a monitor, while still executing the malicious action
Monitorability Tax: The potential cost (in performance or inference expense) required to maintain model transparency, such as not optimizing the CoT directly to preserve its honesty
Scalable Oversight: Using weaker, trusted systems (like humans or smaller models) to supervise and control stronger, more capable AI systems
Frontier Model: The most advanced state-of-the-art AI models available (e.g., OpenAI o1, o3-mini)
Unit Test: A software testing method where individual units of source code are tested to determine whether they are fit for use
RL: Reinforcement Learning—training models by rewarding desired behaviors and punishing undesired ones