LRM: Large Reasoning Model—LLMs explicitly trained to generate structured, multi-step reasoning traces (e.g., OpenAI o1, DeepSeek-R1)
Safety Primer: The specific 8-token phrase 'Let's think about safety first' injected at the start of the reasoning block
Safety Tax: The degradation in a model's general reasoning or problem-solving performance caused by aggressive safety alignment measures
Jailbreak: Adversarial prompts designed to bypass a model's safety filters and elicit harmful content
ASR: Attack Success Rate—the percentage of adversarial prompts that successfully trigger a harmful response
Deep Alignment: Safety behavior that persists throughout the model's internal processing (reasoning chain) rather than just at the surface output level
Direct Refusal: A baseline method where the model is trained to immediately reject harmful prompts without reasoning
SafeChain: A baseline method that supervises both the reasoning trace and the final answer to ensure safety