CoT: Chain-of-Thought—a technique where models generate intermediate reasoning steps before the final answer
Process Supervision: Training methods that reward or penalize the intermediate steps of reasoning, not just the final output
Steganography: The practice of concealing information within another message (e.g., text) so that the presence of the information is hidden
Regex monitor: A simple monitoring system that uses regular expressions to detect and penalize specific forbidden strings in the text
Load-bearing: Describes reasoning steps that are causally necessary for the model to produce the correct output; if removed, performance drops
PPO: Proximal Policy Optimization—a standard reinforcement learning algorithm used to fine-tune language models
Vacuous CoT: Reasoning traces that look like reasoning but do not actually contribute to the model's final answer
Obfuscated reasoning: Reasoning that hides harmful planning from a monitor; includes both internalized reasoning (done in hidden states) and steganographic reasoning (encoded in text)