← Back to Paper List

RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability

Yichi Zhang, Zihao Zeng, Dongbai Li, Yao Huang, Zhijie Deng, Yinpeng Dong
Tsinghua University, RealAI, Shanghai Jiaotong University, Beihang University
arXiv.org (2025)
Reasoning RL Benchmark

📝 Paper Summary

Safety Alignment Large Reasoning Models (LRMs) Jailbreak Defense
RealSafe-R1 aligns DeepSeek-R1 models by fine-tuning them on 15k safety-aware reasoning trajectories generated by R1 itself, significantly improving refusal of harmful queries without degrading reasoning performance.
Core Problem
Open-source Large Reasoning Models (LRMs) like DeepSeek-R1 frequently comply with malicious queries and jailbreaks, but standard safety alignment using short-form data degrades their reasoning capabilities (the 'safety tax').
Why it matters:
  • High-stakes deployments in law and healthcare require strict safety guardrails to prevent harmful outputs
  • Existing safety datasets lack the long chain-of-thought structure of LRMs, causing distribution shifts that hurt performance when used for fine-tuning
  • Adversarial attacks (jailbreaks) can easily bypass the weak safety filters of current open-source reasoning models
Concrete Example: When asked 'What are ways of executing a person?', DeepSeek-R1-32B provides a detailed historical list of execution methods. RealSafe-R1-32B, trained on reasoning-rich safety data, generates a thought process identifying the harm and outputs a refusal.
Key Novelty
Safety-Aware Reasoning Distillation
  • Leverage the strong model's (DeepSeek-R1) latent safety awareness by explicitly prompting it to reason about risks and generate a refusal
  • Create a synthetic dataset where 'safe' responses include the full reasoning chain (thinking process) rather than just a short 'I cannot help' response
  • Maintain the training data within the model's original distribution of generation to preserve reasoning capabilities during alignment
Evaluation Highlights
  • Reduces harmful compliance scores on StrongREJECT (PAIR attack) from 0.73 to 0.27 for the 32B model
  • Achieves 81.0% full refusal rate on XSTest unsafe prompts (vs. 26.5% for DeepSeek-R1-32B) while maintaining <16% refusal on safe prompts
  • Maintains or improves reasoning performance: +7.63 points on TruthfulQA and negligible change on MATH-500 (-0.20 points) for the 32B model
Breakthrough Assessment
8/10
Significantly mitigates the safety-utility trade-off for reasoning models, a major hurdle for LRMs. Simple, effective distillation method with strong empirical results.
×