R2D: Reasoning-to-Defend—the proposed training paradigm enabling LLMs to use reasoning for self-defense
Pivot Tokens: Special tokens ([SAFE], [UNSAFE], [RETHINK]) generated by the model to explicitly signal the safety status of the current reasoning step
SwaRD: Safety-aware Reasoning Distillation—the process of training a student LLM on reasoning trajectories collected from a teacher model (DeepSeek-R1) regarding safety
CPO: Contrastive Pivot Optimization—a loss function that forces the model to distinguish between the correct safety pivot token and its opposite
ASR: Attack Success Rate—the percentage of jailbreak attempts that successfully elicit a harmful response
GCG: Greedy Coordinate Gradient—an optimization-based jailbreak attack finding adversarial suffixes
PAIR: Prompt Automatic Iterative Refinement—an attack using an attacker LLM to iteratively refine prompts
AutoDAN: Automated Stealthy Jailbreak Attacks—a genetic algorithm-based attack generating stealthy prompts
DeepSeek-R1: A large reasoning model used as the teacher to generate safety reasoning trajectories