← Back to Paper List

SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment

Wonje Jeung, Sangyeon Yoon, Minsuk Kahng, Albert No
Department of Artificial Intelligence, Yonsei University
arXiv.org (2025)
Reasoning RL Benchmark

📝 Paper Summary

Safety Alignment Large Reasoning Models (LRMs) Chain-of-Thought (CoT) Safety
SafePath aligns Large Reasoning Models by fine-tuning them to emit a short safety primer at the start of reasoning for harmful prompts, reducing harmful outputs without degrading complex reasoning capabilities.
Core Problem
Large Reasoning Models (LRMs) are vulnerable to harmful prompts because their structured reasoning pathways can amplify unsafe behaviors, and existing safety methods (like direct refusal) degrade reasoning depth.
Why it matters:
  • Standard safety alignment methods impose a 'Safety Tax,' significantly lowering performance on complex tasks like math and coding when safety constraints are applied
  • LRMs are susceptible to sophisticated jailbreak attacks where the model mistakenly assesses harmful intent as benign during its internal deliberation
  • Existing defenses like Direct Refusal or SafeChain require computationally expensive training or supervision of full reasoning traces
Concrete Example: When asked how to build a bomb 'out of curiosity,' a standard LRM may reason that the academic intent makes it safe and generate instructions. SafePath triggers a 'Let's think about safety first' thought process that redirects the reasoning trajectory away from harm.
Key Novelty
Safety Primer Injection (Early Alignment)
  • Fine-tunes the model to output a fixed 8-token prefix ('Let's think about safety first') immediately after the reasoning start token (<think>) only when encountering harmful prompts
  • Leaves the rest of the reasoning trace unsupervised and open-ended (no closing </think> tag), allowing the model to naturally reason its way to a safe refusal rather than being forced to stop
  • Relies on an emergent 'deep alignment' behavior where the model learns to autonomously re-activate the safety primer during intermediate reasoning steps if the trajectory becomes unsafe
Evaluation Highlights
  • Reduces harmful responses by 90.0% and blocks 83.3% of jailbreak attempts in DeepSeek-R1-Distill-Llama-8B compared to the base model
  • Achieves ~300x faster training speed than baselines (295.9x vs Direct Refusal, 314.1x vs SafeChain), requiring only 20 training steps for the 8B model
  • Maintains reasoning accuracy on AIME24 (54.4% vs 55.4% base) where Direct Refusal drops significantly (to 38.8%)
Breakthrough Assessment
8/10
Significant efficiency gains and a novel 'soft' alignment approach that preserves reasoning capabilities better than rigid refusals. The emergent re-activation of the safety primer is a strong finding.
×