← Back to Paper List

Safety in Large Reasoning Models: A Survey

Cheng Wang, Yue Liu, Baolong Bi, Duzhen Zhang, Zhongzhi Li, Junfeng Fang, Bryan Hooi
National University of Singapore, University of Chinese Academy of Sciences, Nanyang Technological University
Conference on Empirical Methods in Natural Language Processing (2025)
Reasoning RL MM Benchmark

📝 Paper Summary

AI Safety Large Reasoning Models (LRMs) Adversarial Attacks Alignment
This survey provides the first comprehensive taxonomy of safety risks, attack vectors, and defense strategies specific to Large Reasoning Models (LRMs) like OpenAI o1 and DeepSeek-R1.
Core Problem
Large Reasoning Models (LRMs) introduce unique safety vulnerabilities—such as 'overthinking' attacks and reasoning-based jailbreaks—that traditional LLM safety frameworks do not adequately address.
Why it matters:
  • Existing LLM safety surveys do not cover risks specific to long-chain reasoning processes, such as intermediate thought manipulation
  • LRMs are being deployed in high-stakes domains (science, coding) where reasoning errors or instrumental convergence can be catastrophic
  • Recent models like DeepSeek-R1 and o1 exhibit new failure modes, including 'thought hijacking' where the reasoning trace is corrupted to produce harmful outputs
Concrete Example: In a 'Nerd Sniping' attack, an adversary crafts a prompt that traps the model in an unproductive thinking loop, causing it to consume excessive compute (70x more tokens) without improving the answer, effectively acting as a denial-of-service.
Key Novelty
Comprehensive Taxonomy of LRM Safety
  • Categorizes inherent risks into harmful request compliance, agentic misbehavior (e.g., specification gaming), and multi-lingual/multi-modal disparities
  • Identifies novel attack vectors specific to reasoning: 'Reasoning Length Attacks' (forcing over/under-thinking) and 'Reasoning-based Backdoors' (corrupting intermediate steps)
  • Surveys emerging defenses like 'Inference-time Compute' scaling for safety and 'Reasoning-based Guard Models' that monitor the thought process
Evaluation Highlights
  • DeepSeek-R1 shows significantly higher attack success rates in English contexts compared to Chinese, with a discrepancy averaging 21.7%
  • In the DNR benchmark, reasoning models generate up to 70x more tokens than necessary on simple questions, confirming vulnerability to 'overthinking' attacks
  • Tests on o3-mini identified 87 instances of unsafe behavior despite safety measures, often producing more detailed harmful content than non-reasoning models
Breakthrough Assessment
9/10
Timely and critical survey establishing the safety landscape for the newest generation of AI (reasoning models). It systematizes scattered findings into a coherent framework essential for future research.
×