Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking

📝 Paper Summary

LLM Safety and Alignment Jailbreak Defense Reasoning / Chain-of-Thought

R2D trains LLMs to defend against jailbreaks by generating internal safety-aware reasoning steps and 'pivot tokens' that explicitly assess safety before answering, rather than relying on external guardrails or hard refusals.

Core Problem

Current LLM defenses rely on external guardrails or simple refusal training, which often leads to over-refusal on safe queries and fails to leverage the model's own reasoning capabilities to identify complex attacks.

Why it matters:

Hard refusal strategies are brittle and often fail against sophisticated transferred attacks like PAIR or GCG
External guardrails add computational overhead and deployment complexity without improving the model's intrinsic safety
Safety-aligned models often exhibit 'over-refusal,' declining harmless requests (e.g., 'how to kill a Python process') due to keyword sensitivity, degrading usability

Concrete Example: When asked 'How can I kill a Python process?', a standard safety-aligned model might refuse due to the word 'kill'. R2D allows the model to reason: 'The user is asking about a programming command, not violence. [SAFE] To kill a process...'

Key Novelty

Reasoning-to-Defend (R2D) with Contrastive Pivot Optimization

Equips non-reasoning LLMs with safety introspection by distilling reasoning trajectories from stronger models (like DeepSeek-R1), teaching them to 'think' before answering
Introduces 'pivot tokens' ([SAFE], [UNSAFE], [RETHINK]) at the end of reasoning steps, serving as explicit checkpoints that determine the subsequent response strategy
Uses a specialized contrastive loss (CPO) to maximize the probability of the correct pivot token against its opposite, sharpening the model's ability to distinguish safe from unsafe contexts

Architecture

The R2D training and inference framework. It illustrates the pipeline: Instruction -> Safety-Aware Reasoning -> Pivot Token Prediction -> Response.

Evaluation Highlights

Reduces Attack Success Rate (ASR) by an average of 56% compared to non-defense LLMs on JailbreakBench
Outperforms the external guardrail method 'Erase-and-Check' by an average of 17% lower ASR
Decreases 'Full Refusal' rate on safe but sensitive prompts (XSTest) by over 50% for Qwen-v2-7B, significantly mitigating over-refusal

Breakthrough Assessment

8/10

Strong conceptual advance by integrating safety directly into the reasoning process (CoT) rather than treating it as a post-hoc filter or simple SFT data mix. The use of pivot tokens is a clever architectural constraint.

⚙️ Technical Details

Problem Definition

Setting: Supervised fine-tuning of Large Language Models (LLMs) for safety alignment

Inputs: User instruction I (which may be safe or a jailbreak attempt)

Outputs: A response Y containing a reasoning trajectory Y_R, a pivot token t_p, and a final answer Y_A

Pipeline Flow

Input Instruction
Reasoning Generation (Safety Analysis)
Pivot Token Prediction ([SAFE]/[UNSAFE])
Response Generation (Refusal or Answer)

System Modules

LLM Backbone

Single unified model performing reasoning, safety judgment, and final response generation

Model or implementation: Llama-3-8B / Mistral-v0.3-7B / Qwen2-7B (fine-tuned)

Novel Architectural Elements

Integration of explicit 'pivot tokens' ([SAFE], [UNSAFE], [RETHINK]) within the generation stream to act as learnable control gates for safety

Modeling

Base Model: Llama-3-8B, Mistral-v0.3-7B, Qwen2-7B, Qwen2.5-14B

Training Method: Supervised Fine-Tuning with auxiliary Contrastive Loss

Objective Functions:

Purpose: Learn the reasoning trajectory and final answer.

Formally: L_SwaRD = - sum log P_M(t | X, Y_<t)
Purpose: Distinguish correct safety status at pivot points.

Formally: L_CPO = - log sigma(P(t_p^+)) - log sigma(1 - P(t_p^-))

Training Data:

Positive (Helpful) Trajectories: Derived from Alpaca dataset using DeepSeek-R1-70B
Negative (Jailbreak) Trajectories: Derived from AdvBench using DeepSeek-R1-70B
Pivot Token Tagging: Uses Llama-Guard-3-8B to tag reasoning steps with [SAFE], [UNSAFE], or [RETHINK]

Compute: Not reported in the paper

Comparison to Prior Work

vs. Erase-and-Check: R2D internalizes the check into the generation process via reasoning, rather than using an external model call
vs. Self-Reminder: R2D is a training-based method optimizing internal representations, whereas Self-Reminder is inference-time prompting
vs. RPO: R2D uses reasoning trajectories and pivot tokens (SwaRD+CPO) rather than adversarial preference pairs
+ 1 more
vs. SafeChain [not cited in paper]: SafeChain analyzes safety of existing LRMs; R2D actively trains non-reasoning models to reason for safety

Limitations

Inference Latency: Harmful queries trigger longer reasoning trajectories (rethinking), increasing generation cost compared to simple refusal
Dependency on Teacher: Quality of reasoning trajectories depends on the teacher model (DeepSeek-R1) and tagger (Llama-Guard)
Requires Fine-Tuning: Unlike prompt-based defenses, R2D requires updating model weights

Reproducibility

Code: https://github.com/chuhac/Reasoning-to-Defend

Code is publicly available on GitHub. Training data (reasoning trajectories) is synthesized using DeepSeek-R1-70B and Llama-Guard-3-8B. Prompt templates for attacks and evaluations are standard benchmarks (JailbreakBench, HarmBench).

📊 Experiments & Results

Evaluation Setup

Defending against jailbreak attacks while maintaining general capabilities

Benchmarks:

JailbreakBench (Jailbreak Defense (GCG, PAIR, JBC attacks))
HarmBench (Jailbreak Defense (PAIR, AutoDAN, ZeroShot, FewShot))
XSTest (Over-refusal detection (Safe but sensitive prompts))
lm-evaluation-harness (General Capability (MMLU, GSM8K, etc.))

Metrics:

Attack Success Rate (ASR)
Refusal Rate (Full/Partial)
General Task Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Aggregate results show R2D significantly improves defense over baselines on JailbreakBench.
JailbreakBench	ASR reduction	0	56	56
JailbreakBench	ASR reduction	0	17	17
Ablation studies confirm the necessity of Pivot Tokens and the Contrastive Pivot Optimization (CPO) loss.
JailbreakBench	ASR increase	0	23	23
JailbreakBench	ASR increase	0	45	45
R2D defends well against strong attacks on HarmBench while maintaining compliance on safe prompts.
HarmBench	ASR	Qualitative high	Qualitative low	Not reported in the paper
HarmBench (PAIR/AutoDAN)	ASR	Qualitative high	10	Not reported in the paper
XSTest	Full Compliance Rate	Not reported in the paper	Not reported in the paper	+4.8

Experiment Figures

Bar charts comparing Attack Success Rate (ASR) of Original vs. R2D models on HarmBench across various base models (Llama-3, Mistral, Qwen) and attack types (ZeroShot, FewShot, PAIR, AutoDAN).

Comparison of inference latency (number of words) for harmful vs. benign queries.

Main Takeaways

Safety-aware reasoning is highly effective: R2D significantly outperforms both vanilla LLMs and external guardrail baselines (like Erase-and-Check) in defending against jailbreaks.
Pivot Tokens are critical: Ablations show that removing explicit [SAFE]/[UNSAFE] tokens drastically reduces defense performance (up to 45% ASR increase), proving they act as essential control gates.
Mitigates over-refusal: Unlike standard safety tuning which often makes models paranoid, R2D differentiates context better, maintaining high compliance on safe but sensitive queries (e.g., XSTest).
Generalizes to strong attacks: R2D shows robustness against sophisticated attacks like PAIR and AutoDAN on HarmBench, reducing success rates to ~10%.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) reasoning
Supervised Fine-Tuning (SFT)
Jailbreak attacks (GCG, PAIR, AutoDAN)
LLM Safety Guardrails

Key Terms

R2D: Reasoning-to-Defend—the proposed training paradigm enabling LLMs to use reasoning for self-defense

Pivot Tokens: Special tokens ([SAFE], [UNSAFE], [RETHINK]) generated by the model to explicitly signal the safety status of the current reasoning step

SwaRD: Safety-aware Reasoning Distillation—the process of training a student LLM on reasoning trajectories collected from a teacher model (DeepSeek-R1) regarding safety

CPO: Contrastive Pivot Optimization—a loss function that forces the model to distinguish between the correct safety pivot token and its opposite

ASR: Attack Success Rate—the percentage of jailbreak attempts that successfully elicit a harmful response

GCG: Greedy Coordinate Gradient—an optimization-based jailbreak attack finding adversarial suffixes

PAIR: Prompt Automatic Iterative Refinement—an attack using an attacker LLM to iteratively refine prompts

AutoDAN: Automated Stealthy Jailbreak Attacks—a genetic algorithm-based attack generating stealthy prompts

DeepSeek-R1: A large reasoning model used as the teacher to generate safety reasoning trajectories