LRM: Large Reasoning Model—models trained to generate extended step-by-step reasoning (Chain-of-Thought) before producing a final answer (e.g., OpenAI o1, DeepSeek-R1)
CoT: Chain-of-Thought—intermediate reasoning steps generated by a model to solve complex problems
Refusal Direction: A specific direction (vector) in the model's activation space that encodes the decision to refuse a harmful request; identified by contrasting activations of harmful vs. harmless prompts
Refusal Dilution: The phenomenon where the strength of the refusal signal (projection onto the refusal direction) decreases as the sequence length of benign reasoning increases
ASR: Attack Success Rate—the percentage of harmful prompts that successfully elicit a harmful response from the target model
Attention Ratio: The ratio of attention weights assigned to harmful instruction tokens versus benign puzzle tokens
System 2 Thinking: Slow, deliberative, step-by-step reasoning processes, as opposed to fast, intuitive System 1 responses