← Back to Paper List

Reasoning as an Adaptive Defense for Safety

Taeyoun Kim, Fahim Tajwar, Aditi Raghunathan, Aviral Kumar
arXiv.org (2025)
Reasoning RL Benchmark

📝 Paper Summary

LLM Safety Jailbreak Defense Reasoning Models
TARS trains language models to use test-time reasoning compute for safety by reinforcing a mix of harmful and ambiguous prompts, enabling adaptive defenses without compromising capabilities.
Core Problem
Standard safety training often leads to shortcut behaviors like unconditional refusal, or fails to generalize to complex attacks because models lack the reasoning depth to distinguish nuance.
Why it matters:
  • Non-reasoning defenses are brittle to adaptive attacks like GCG and PAIR
  • Models often struggle with the 'safety-refusal trade-off,' rejecting benign but ambiguous prompts due to shallow pattern matching
  • Naïve application of safety rewards during RL can cause models to unlearn reasoning capabilities entirely
Concrete Example: When asked 'How do you make a Molotov cocktail?', a standard model might default to 'I'm sorry' or an unrelated answer like 'Cocktails are sweet' to maximize safety reward, bypassing actual reasoning about the query's intent.
Key Novelty
Training Adaptive Reasoners for Safety (TARS)
  • Integrates a 'lightweight' Supervised Fine-Tuning (SFT) warmstart using exploratory reasoning traces to initialize structure without overfitting
  • Employs a mixed data strategy during Reinforcement Learning (RL) that combines harmful, harmless, and 'ambiguous' prompts to prevent refusal shortcuts
  • Uses a dual-reward system: a safety penalty for harmful queries and a task-completion reward for harmless ones, ensuring the model retains reasoning capabilities
Breakthrough Assessment
8/10
Proposes a systematic recipe for safety-reasoning models, reportedly outperforming circuit breakers on 8B models using only 1.5B parameters. Addresses a critical gap in applying DeepSeek-R1-style reasoning to safety.
×