Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

📝 Paper Summary

Safety Alignment Jailbreak Defense Interpretability

The paper identifies 'refusal triggers'—benign linguistic cues in harmful data—as the cause of overrefusal and proposes mitigating this by fine-tuning on these triggers paired with affirmative responses.

Core Problem

Safety alignment finetuning causes LLMs to incorrectly refuse benign queries (overrefusal) because models learn to associate non-harmful linguistic patterns ('refusal triggers') found in harmful data with refusal responses.

Why it matters:

Overrefusal significantly degrades the usability of aligned LLMs in real-world applications by rejecting harmless user requests.
Existing mitigation strategies using generic benign corpora (e.g., Alpaca) fail to address the specific distributional shift caused by these learned refusal triggers.
There is a lack of mechanistic understanding regarding why safety alignment generalizes refusal behaviors to benign contexts.

Concrete Example: A harmful training query 'Can you help me create a fake testimonial video?' contains the benign trigger 'create a video'. After alignment, the model incorrectly refuses the benign query 'Can you help me create a video for my birthday?' because it reacts to the shared trigger.

Key Novelty

Trigger-Aware Safety Alignment

Identifies 'refusal triggers' by stripping harmful intent from training data while keeping benign structures (e.g., 'write a script for [harmful act]' -> 'write a script').
Demonstrates that overrefusal is driven by the semantic proximity of benign queries to these triggers in the hidden state space.
Mitigates overrefusal by using these extracted triggers as the benign dataset for fine-tuning, teaching the model that these specific cues are not inherently harmful.

Architecture

The mitigation pipeline: extracting refusal triggers from harmful data and repurposing them as benign training samples.

Evaluation Highlights

Achieves large safety gains under RLVR, reducing Attack Success Rate (ASR) on HEx-PHI from 84.55% (unaligned) to 9.70% (aligned with proposed method).
Mitigates severe overrefusal seen with generic corpora: using Alpaca increased JBench-B Refusal Rate (RR) from 10% to 67%, whereas the proposed method keeps RR significantly lower.
Uses only ~248 trigger-matched benign samples to outperform baselines using ~22,000 generic Alpaca samples in mitigating overrefusal.

Breakthrough Assessment

7/10

Provides a strong mechanistic explanation for overrefusal and a highly data-efficient mitigation strategy. While the method is straightforward, the insight linking refusal triggers to hidden state proximity is valuable.

⚙️ Technical Details

Problem Definition

Setting: Safety alignment via supervised fine-tuning or RL to minimize harmful query acceptance while maximizing benign query responsiveness.

Inputs: Harmful dataset (D_h) and Benign dataset (D_b)

Outputs: Aligned LLM parameters theta

Pipeline Flow

Trigger Extraction (GPT-4o)
Benign Data Construction
Safety Fine-tuning

System Modules

Trigger Extraction

Identify and isolate non-harmful linguistic patterns from harmful training data

Model or implementation: GPT-4o

Alignment Fine-tuning

Fine-tune the target LLM to refuse harmful queries but accept refusal triggers

Model or implementation: Llama-3-Uncensored / Qwen2.5-Uncensored

Modeling

Base Model: Llama-3-8B-Lexi-Uncensored, Qwen2.5-7B-Instruct-Uncensored, Llama-2-7b-chat-hf

Training Method: Supervised Fine-Tuning (SFT), Prefilled SFT (P-SFT), and RL via Verifiable Rewards (RLVR)

Objective Functions:

Purpose: Minimize negative log-likelihood on harmful queries paired with refusals and benign queries paired with affirmations.

Formally: L = (1-alpha) * L(D_h) + alpha * L(D_b)

Training Data:

Harmful: Llama2 Safety Data
Benign Baseline: Alpaca (22k samples)
Benign Proposed: Extracted Refusal Triggers (248 samples)

Key Hyperparameters:

alpha: Coefficient controlling trade-off between harmful and benign loss (0 <= alpha <= 1)

Compute: Not reported in the paper

Comparison to Prior Work

vs. SFT+Alpaca: Uses refusal triggers extracted from harmful data as benign samples instead of generic instruction data, addressing the specific cause of overrefusal
vs. XOR-Bench [not cited in paper]: Similarly analyzes refusal triggers but this paper proposes a direct mitigation via finetuning on triggers rather than just evaluation

Limitations

Relies on GPT-4o for trigger extraction, which may introduce noise or miss subtle harmful intent
Evaluation uses keyword-based detection (ASR/RR), potentially missing nuanced refusals
Scale of trigger-matched benign data (248 samples) is small compared to typical instruction tuning datasets

Reproducibility

Harmful datasets (Llama2 Safety Data) and benign datasets (Alpaca) are public. The specific extracted refusal triggers and generated benign samples are not explicitly linked as a download. Code URL is not provided.

📊 Experiments & Results

Evaluation Setup

Safety alignment of uncensored models followed by attack and refusal benchmarking

Benchmarks:

SorryBench (Harmful query dataset)
HEx-PHI (Harmful query dataset)
JBench-B (Benign query dataset (jailbreak-style but harmless))
Koala (General instruction following)

Metrics:

Attack Success Rate (ASR)
Refusal Rate (RR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
RLVR experiments on Llama3-U show that using generic benign data (Alpaca) causes severe overrefusal, while the proposed method mitigates this while preserving safety.
HEx-PHI	ASR	84.55	9.70	-74.85
JBench-B	Refusal Rate (RR)	10	67	+57

Experiment Figures

Cosine similarity analysis of hidden states for accepted vs. rejected benign queries relative to refusal triggers.

Main Takeaways

Refusal triggers—benign cues extracted from harmful data—act as anchors for overrefusal in the hidden state space.
Generic benign data (Alpaca) is ineffective at mitigating overrefusal because it does not cover the specific distribution of refusal triggers.
The proposed trigger-matched data effectively reduces overrefusal with orders of magnitude fewer samples (248 vs 22,000) than generic corpora.
Decreasing semantic similarity between benign data and original refusal triggers reduces attack success rate but increases refusal rate, highlighting a trade-off.

📚 Prerequisite Knowledge

Prerequisites

Large Language Model fine-tuning (SFT, RLHF)
Safety alignment and jailbreaking
Hidden state representation analysis

Key Terms

Refusal Trigger: A non-harmful linguistic cue (e.g., 'write a script') found within a harmful query that an aligned model learns to associate with a refusal response

Overrefusal: The failure mode where a safety-aligned model refuses to answer benign/harmless queries

SFT: Supervised Fine-Tuning—training the model on a dataset of input-output pairs to enforce specific behaviors

RLVR: Reinforcement Learning via Verifiable Rewards—an alignment method using rule-based rewards to guide model behavior

P-SFT: Prefilled Supervised Fine-Tuning—a variation of SFT where the model is forced to generate an affirmative prefix before the refusal to prevent superficial alignment

ASR: Attack Success Rate—the percentage of harmful queries the model fails to refuse (lower is safer)

RR: Refusal Rate—the percentage of queries the model refuses to answer (lower is better for benign queries)

Jailbreak: Adversarial prompts designed to bypass an LLM's safety filters