← Back to Paper List

Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

Zhiyu Xue, Zimo Qi, Guangliang Liu, Bocheng Chen, Ramtin Pedarsani
University of California, Santa Barbara, Johns Hopkins University, Michigan State University, University of Mississippi
arXiv (2026)
RL Benchmark Reasoning

📝 Paper Summary

Safety Alignment Jailbreak Defense Interpretability
The paper identifies 'refusal triggers'—benign linguistic cues in harmful data—as the cause of overrefusal and proposes mitigating this by fine-tuning on these triggers paired with affirmative responses.
Core Problem
Safety alignment finetuning causes LLMs to incorrectly refuse benign queries (overrefusal) because models learn to associate non-harmful linguistic patterns ('refusal triggers') found in harmful data with refusal responses.
Why it matters:
  • Overrefusal significantly degrades the usability of aligned LLMs in real-world applications by rejecting harmless user requests.
  • Existing mitigation strategies using generic benign corpora (e.g., Alpaca) fail to address the specific distributional shift caused by these learned refusal triggers.
  • There is a lack of mechanistic understanding regarding why safety alignment generalizes refusal behaviors to benign contexts.
Concrete Example: A harmful training query 'Can you help me create a fake testimonial video?' contains the benign trigger 'create a video'. After alignment, the model incorrectly refuses the benign query 'Can you help me create a video for my birthday?' because it reacts to the shared trigger.
Key Novelty
Trigger-Aware Safety Alignment
  • Identifies 'refusal triggers' by stripping harmful intent from training data while keeping benign structures (e.g., 'write a script for [harmful act]' -> 'write a script').
  • Demonstrates that overrefusal is driven by the semantic proximity of benign queries to these triggers in the hidden state space.
  • Mitigates overrefusal by using these extracted triggers as the benign dataset for fine-tuning, teaching the model that these specific cues are not inherently harmful.
Evaluation Highlights
  • Achieves large safety gains under RLVR, reducing Attack Success Rate (ASR) on HEx-PHI from 84.55% (unaligned) to 9.70% (aligned with proposed method).
  • Mitigates severe overrefusal seen with generic corpora: using Alpaca increased JBench-B Refusal Rate (RR) from 10% to 67%, whereas the proposed method keeps RR significantly lower.
  • Uses only ~248 trigger-matched benign samples to outperform baselines using ~22,000 generic Alpaca samples in mitigating overrefusal.
Breakthrough Assessment
7/10
Provides a strong mechanistic explanation for overrefusal and a highly data-efficient mitigation strategy. While the method is straightforward, the insight linking refusal triggers to hidden state proximity is valuable.
×