RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability

📝 Paper Summary

Safety Alignment Large Reasoning Models (LRMs) Jailbreak Defense

RealSafe-R1 aligns DeepSeek-R1 models by fine-tuning them on 15k safety-aware reasoning trajectories generated by R1 itself, significantly improving refusal of harmful queries without degrading reasoning performance.

Core Problem

Open-source Large Reasoning Models (LRMs) like DeepSeek-R1 frequently comply with malicious queries and jailbreaks, but standard safety alignment using short-form data degrades their reasoning capabilities (the 'safety tax').

Why it matters:

High-stakes deployments in law and healthcare require strict safety guardrails to prevent harmful outputs
Existing safety datasets lack the long chain-of-thought structure of LRMs, causing distribution shifts that hurt performance when used for fine-tuning
Adversarial attacks (jailbreaks) can easily bypass the weak safety filters of current open-source reasoning models

Concrete Example: When asked 'What are ways of executing a person?', DeepSeek-R1-32B provides a detailed historical list of execution methods. RealSafe-R1-32B, trained on reasoning-rich safety data, generates a thought process identifying the harm and outputs a refusal.

Key Novelty

Safety-Aware Reasoning Distillation

Leverage the strong model's (DeepSeek-R1) latent safety awareness by explicitly prompting it to reason about risks and generate a refusal
Create a synthetic dataset where 'safe' responses include the full reasoning chain (thinking process) rather than just a short 'I cannot help' response
Maintain the training data within the model's original distribution of generation to preserve reasoning capabilities during alignment

Evaluation Highlights

Reduces harmful compliance scores on StrongREJECT (PAIR attack) from 0.73 to 0.27 for the 32B model
Achieves 81.0% full refusal rate on XSTest unsafe prompts (vs. 26.5% for DeepSeek-R1-32B) while maintaining <16% refusal on safe prompts
Maintains or improves reasoning performance: +7.63 points on TruthfulQA and negligible change on MATH-500 (-0.20 points) for the 32B model

Breakthrough Assessment

8/10

Significantly mitigates the safety-utility trade-off for reasoning models, a major hurdle for LRMs. Simple, effective distillation method with strong empirical results.

⚙️ Technical Details

Problem Definition

Setting: Supervised Fine-Tuning (SFT) of Large Reasoning Models for safety alignment

Inputs: User query q (potentially harmful)

Outputs: Reasoning chain followed by a safe response (refusal or answer)

Pipeline Flow

Input Query
Reasoning/Thinking Generation (Safety Analysis)
Final Response Generation (Refusal or Compliance)

System Modules

RealSafe-R1

End-to-end reasoning and response generation

Model or implementation: DeepSeek-R1 Distill (Llama/Qwen variants, 1.5B to 32B parameters)

Novel Architectural Elements

Integration of safety checks *within* the visible chain-of-thought reasoning process rather than as a post-hoc filter or separate reward model

Modeling

Base Model: DeepSeek-R1 Distill series (1.5B, 7B, 8B, 14B, 32B)

Training Method: Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Minimize the difference between model predictions and the safety-aware reasoning trajectories.

Formally: Standard cross-entropy loss on the generated tokens.

Adaptation: Full fine-tuning

Training Data:

15k total samples: 10k direct harmful queries, 5k jailbreak prompts
Source prompts from PKU-SafeRLHF and JailbreakV-28k
Data generated by prompting DeepSeek-R1 to 'find underlying risks' and 'explicitly refuse' if unsafe
Filtered to retain only trajectories with explicit refusals

Key Hyperparameters:

learning_rate: 5e-6
batch_size: 128
epochs: 1
+ 1 more
warmup_ratio: 0.1

Compute: Trained on NVIDIA A800 GPUs

Comparison to Prior Work

vs. DeepSeek-R1: RealSafe-R1 adds safety alignment via SFT on distilled safety trajectories
vs. SafeChain: RealSafe-R1 uses SFT on generated reasoning traces rather than inference-time interventions or chain editing; shows significantly higher refusal rates (87.0% vs 25.0% on XSTest Unsafe for 8B model)
vs. STAR-1 [not cited in paper]: STAR-1 also uses reasoning data for safety but focuses on policy optimization; RealSafe-R1 focuses on simple SFT distillation

Limitations

Over-refusal on benign prompts increases slightly (e.g., full compliance on XSTest Safe drops from 90.8% to 56.8% for 8B model)
Method relies on the base model (DeepSeek-R1) having enough latent safety awareness to generate the initial training data
Limited diversity in the safety training data (only 15k samples focusing on refusal)
Larger models (32B) tend to be harder to align/refuse less than smaller models even after training

Reproducibility

Code: https://huggingface.co/RealSafe

Publicly available: Model weights on HuggingFace. Missing: Exact training scripts and the specific 15k dataset file (though construction method is detailed). Closed-source dependencies: None (uses open DeepSeek-R1 for data gen).

📊 Experiments & Results

Evaluation Setup

Comparison of original vs. aligned models on general reasoning and safety benchmarks

Benchmarks:

MATH-500 (Mathematical reasoning)
AIME 2024 (Competition math)
GPQA-Diamond (Scientific reasoning)
LiveCodeBench (Code generation)
TruthfulQA (Truthfulness/Hallucination)
StrongREJECT (Safety/Jailbreak defense)
XSTest (Over-refusal and safety)
WildChat (Real-world user queries (unsafe subset))

Metrics:

Accuracy (Exact Match)
Pass@1
Harmful Score (0-1)
Refusal Rate (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Safety benchmarks demonstrate significant improvements in refusal capabilities for RealSafe-R1 compared to DeepSeek-R1, particularly against jailbreaks.
StrongREJECT (PAIR Attack)	Harmful Score (lower is better)	0.73	0.27	-0.46
StrongREJECT (None/Direct)	Harmful Score (lower is better)	0.25	0.00	-0.25
WildChat (Unsafe)	Full Refusal Rate	49.6	67.8	+18.2
General reasoning benchmarks show that safety alignment preserves or even slightly improves performance on standard tasks.
MATH-500	Accuracy	95.90	95.70	-0.20
AIME 2024	Accuracy	66.67	71.43	+4.76
TruthfulQA	Truthfulness	64.30	71.93	+7.63
Comparison with SafeChain (another safety method) on the 8B model size.
XSTest (Unsafe)	Full Refusal Rate	25.0	87.0	+62.0

Experiment Figures

Bar charts comparing refusal rates (Full Refusal, Partial Refusal, Compliance) for DeepSeek-R1 vs. RealSafe-R1 across model sizes on XSTest and WildChat.

Case studies of model responses to harmful and jailbreak queries.

Main Takeaways

RealSafe-R1 drastically reduces harmful responses to both direct and jailbroken queries compared to the base DeepSeek-R1 models.
The 'safety tax' (performance degradation) is effectively non-existent; reasoning scores on MATH and Code benchmarks are preserved or slightly improved.
Smaller models (e.g., 8B) show larger gains in refusal rates compared to larger models (32B), which remain slightly more stubborn/compliant.
Truthfulness (TruthfulQA) improves alongside safety, suggesting a synergy between safety alignment and honesty in LRMs.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Model (LLM) fine-tuning
Familiarity with Chain-of-Thought (CoT) reasoning
Basic knowledge of jailbreak attacks (adversarial prompts)

Key Terms

LRM: Large Reasoning Model—LLMs optimized for complex reasoning tasks, often producing long 'thought' outputs before the final answer (e.g., DeepSeek-R1, OpenAI o1)

SFT: Supervised Fine-Tuning—training a pre-trained model on a labeled dataset to adapt it to specific instructions or behaviors

Jailbreak: Adversarial prompts designed to bypass a model's safety filters, often by role-playing or framing harmful requests as hypothetical scenarios

PAIR: Prompt Automatic Iterative Refinement—an automated method for generating jailbreak attacks

PAP: Persuasive Adversarial Prompts—a jailbreak strategy that uses persuasion techniques to convince the model to comply

Reasoning Trajectory: The sequence of intermediate thinking steps generated by an LRM before producing a final answer

Distillation: The process of training a smaller or target model using outputs (data) generated by a larger or more capable 'teacher' model

StrongREJECT: A benchmark for evaluating the safety of LLMs against harmful queries and jailbreak attacks