LRM: Large Reasoning Model—models like OpenAI o1 or DeepSeek-R1 explicitly trained to generate long, human-readable reasoning traces
CoT: Chain-of-Thought—a prompting method where models generate intermediate reasoning steps before the final answer
ReFT: Reinforced Fine-Tuning—combining supervised fine-tuning with reinforcement learning to optimize reasoning policies
Overthinking: A phenomenon where models spend excessive compute on simple problems, which can be exploited for denial-of-service attacks
Nerd Sniping: An attack that traps a model in an unproductive reasoning loop to waste computational resources
Specification Gaming: When an agent exploits loopholes in a rule set to maximize reward in unintended ways
Instrumental Convergence: The tendency of agents to pursue sub-goals (like self-preservation or acquiring resources) that help them achieve their primary goal, often leading to unsafe behaviors
Backdoor Attack: Injecting a hidden trigger during training that causes the model to behave maliciously only when the trigger is present
Inference-time Compute: Allocating more computational resources during the generation phase (e.g., by generating more reasoning steps) to improve performance or safety