← Back to Paper List

Safechain: Safety of language models with long chain-of-thought reasoning capabilities

F Jiang, Z Xu, Y Li, L Niu, Z Xiang, B Li, BY Lin…
University of Washington, University of Georgia, University of Chicago
arXiv, 2/2025 (2025)
Reasoning Benchmark RL

📝 Paper Summary

LLM Safety Large Reasoning Models (LRMs) Chain-of-Thought (CoT)
SAFECHAIN reveals that long reasoning traces in large reasoning models (LRMs) do not guarantee safety and introduces a CoT-style dataset to align them without compromising reasoning skills.
Core Problem
Large Reasoning Models (LRMs) like DeepSeek-R1 generate long chains of thought that may contain harmful content, and existing safety evaluations focus only on final answers, missing intermediate risks.
Why it matters:
  • Unsafe reasoning traces can introduce security vulnerabilities in generated code or spread misinformation even if the final refusal is safe
  • Current safety datasets lack the long CoT style required to fine-tune LRMs effectively without degrading their complex reasoning performance
  • The sheer length of LRM outputs makes manual safety evaluation prohibitively expensive
Concrete Example: When asked for napalm recipes, an LRM's reasoning trace might detail the dangerous chemical process (unsafe thought) before the final answer refuses the request. This intermediate leakage is dangerous but often missed by standard answer-only evaluations.
Key Novelty
Safety alignment via Chain-of-Thought (CoT) and Thinking-Aware Decoding
  • Evaluates safety by inspecting both the hidden reasoning trace and the final answer, revealing that safe answers often hide unsafe thoughts
  • Proposes 'ZeroThink' decoding to bypass unsafe reasoning by forcing an empty thought process, relying on the model's instinctual safety
  • Introduces SAFECHAIN, the first safety training dataset consisting of long CoT reasoning traces to align LRMs without losing math/coding abilities
Evaluation Highlights
  • ZeroThink decoding improves R1-7B safety from ~36% to 99.7% on StrongReject (Safe@1) without retraining
  • Fine-tuning R1-7B on SAFECHAIN improves safety on WildJailbreak from 49.6% to 61.2% while maintaining coding performance (LiveCodeBench 39.6% vs 39.3% baseline)
  • Standard baseline alignment (WildJailbreak-40K) destroys reasoning capability, dropping LiveCodeBench score from 39.3% to 14.5% for R1-7B
Breakthrough Assessment
8/10
First systematic study of LRM safety with a novel CoT-specific dataset. The finding that 'thinking' can degrade safety and the solution (ZeroThink/SafeChain) are highly relevant for the emerging wave of reasoning models.
×