The Struggle Between Continuation and Refusal: A Mechanistic Analysis of the Continuation-Triggered Jailbreak in LLMs

📝 Paper Summary

LLM Safety Mechanistic Interpretability Jailbreak Attacks

Continuation-triggered jailbreaks succeed because the model's intrinsic drive to complete patterns, mediated by 'continuation heads,' overpowers the 'safety heads' responsible for refusal when a compliant suffix is appended.

Core Problem

LLMs with safety alignment remain vulnerable to 'continuation-triggered' jailbreaks, where simply moving a compliant suffix (like 'Sure, here is...') from inside the user prompt to the start of the assistant response bypasses refusal mechanisms.

Why it matters:

Demonstrates that current alignment (RLHF/DPO) is 'shallow,' relying on specific prompt structures rather than robust intent understanding
Provides a white-box mechanistic explanation for jailbreaks, moving beyond black-box trial-and-error attacks
Existing defenses often overlook the internal competition between generation and safety circuits

Concrete Example: When a user asks a malicious question with the suffix 'Sure, here is a step-by-step guide:' inside the prompt, LLaMA-2-7B-Chat refuses (ASR 0). However, moving that same suffix to immediately follow the prompt delimiter (appearing as the assistant's start) causes the model to generate the harmful content (ASR 0.58).

Key Novelty

Head Competition Hypothesis

Proposes that jailbreak susceptibility arises from a conflict between 'Safety Heads' (trained to refuse) and 'Continuation Heads' (trained to predict the next token coherently)
Uses Path Patching to causally locate these specific heads and validates their roles via ablation: removing safety heads increases attack success, while removing continuation heads decreases it

Architecture

Conceptual illustration of the continuation-triggered jailbreak mechanism

Evaluation Highlights

LLaMA-2-7B-Chat Attack Success Rate (ASR) increases from 0 (clean) to 0.58 (jailbreak) on the MaliciousInstruct dataset
Qwen2.5-7B-Instruct ASR increases by over 30 percentage points on multiple datasets, reaching 0.68 on MaliciousInstruct
LLaMA-2-7B-Chat ASR rises from 0 to 0.16 on AdvBench and 0 to 0.26 on JailbreakBench under the continuation-triggered setting

Breakthrough Assessment

7/10

Provides a strong mechanistic explanation for a known vulnerability type. The classification of heads into 'safety' vs 'continuation' based on causal intervention is a valuable contribution to interpretability.

⚙️ Technical Details

Problem Definition

Setting: Adversarial evaluation of LLM safety under prompt manipulation

Inputs: Malicious instruction combined with a continuation-triggered suffix (e.g., 'Sure, here is...')

Outputs: Assistant response (Refusal vs. Harmful Content)

Pipeline Flow

Input Construction (Clean vs. Jailbreak Prompts)
Causal Localization (Path Patching)
Functional Categorization (Head Ablation)
Validation (Activation Scaling)

System Modules

Path Patching Analyzer

Identify attention heads critical to the jailbreak by swapping activations between clean and jailbreak runs

Model or implementation: LLaMA-2-7B-Chat / Qwen2.5-7B-Instruct

Head Ablation Filter

Classify key heads by zeroing their activations and observing ASR change

Model or implementation: Same as above

Activation Scaler

Validate causal roles by scaling head outputs during inference

Model or implementation: Same as above

Modeling

Base Model: LLaMA-2-7B-Chat and Qwen2.5-7B-Instruct

Comparison to Prior Work

vs. Adversarial Suffixes: This method uses semantically meaningful, natural language suffixes ('Sure, here is...') rather than optimized gibberish strings
vs. Standard Jailbreaks: Focuses on the *position* of the suffix (structural manipulation) rather than semantic persuasion
vs. Circuit Breaking [not cited in paper]: Investigates competition between existing heads rather than stripping representations directly

Limitations

Analysis is limited to two specific model families (LLaMA-2 and Qwen2.5)
Focuses on a single jailbreak type (continuation-triggered), so generalization to other attacks is unproven
Reply inversion results mentioned in methodology are not fully detailed in the provided text

📊 Experiments & Results

Evaluation Setup

Comparing Attack Success Rate (ASR) between 'Clean' prompts (suffix inside user turn) and 'Jailbreak' prompts (suffix outside user turn)

Benchmarks:

AdvBench (Harmful instruction generation)
JailbreakBench (Harmful instruction generation)
MaliciousInstruct (Harmful instruction generation)

Metrics:

Attack Success Rate (ASR)
KL Divergence (for path patching)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative results showing the vulnerability of models to the continuation-triggered jailbreak (moving the suffix outside the user prompt).
MaliciousInstruct	ASR	0	0.58	+0.58
JailbreakBench	ASR	0	0.26	+0.26
AdvBench	ASR	0	0.16	+0.16
MaliciousInstruct	ASR	Not reported in the paper	0.68	Not reported in the paper

Experiment Figures

Bar charts comparing Attack Success Rate (ASR) on Clean vs. Jailbreak prompts for LLaMA-2 and Qwen2.5 across three datasets

Main Takeaways

Models exhibit extreme sensitivity to prompt structure: simply moving a suffix from the user turn to the assistant turn bypasses safety alignment
Internal mechanisms reveal a 'tug-of-war': Safety Heads pull towards refusal, Continuation Heads pull towards compliance
Ablating Safety Heads makes the model more vulnerable, while ablating Continuation Heads restores safety, confirming their causal roles

📚 Prerequisite Knowledge

Prerequisites

Mechanistic Interpretability (Attention heads, Circuits)
Large Language Model Safety Alignment (RLHF, DPO)
Jailbreak Attacks

Key Terms

Path Patching: A causal interpretability technique that identifies critical model components by swapping activations between two different runs (e.g., clean vs. corrupted) and measuring the output change

Activation Scaling: An inference-time intervention that multiplies the output vector of specific attention heads by a scalar to amplify or suppress their influence without changing weights

ASR: Attack Success Rate—the proportion of malicious prompts for which the model generates harmful content instead of refusing

Safety Heads: Attention heads whose ablation (zeroing out) leads to an increase in Attack Success Rate, indicating they normally contribute to refusal

Continuation Heads: Attention heads whose ablation leads to a decrease in Attack Success Rate, indicating they normally contribute to complying with/extending the input

KL Divergence: A statistical metric used here to measure how much the model's output distribution shifts away from its safety-aligned baseline when components are patched