Chain-of-Thought Hijacking

📝 Paper Summary

AI Safety Jailbreaking Adversarial Attacks

Extended benign reasoning sequences in Large Reasoning Models systematically weaken safety mechanisms by diluting the internal refusal signal, allowing harmful instructions to bypass guardrails.

Core Problem

Large Reasoning Models (LRMs) allocate compute for step-by-step thinking, which was expected to improve safety; however, long reasoning contexts actually degrade the model's ability to sustain refusal.

Why it matters:

Contradicts the prevailing assumption that 'thinking time' strengthens alignment and safety robustness
Existing jailbreaks often fail on reasoning models or require white-box access, whereas this vulnerability is intrinsic to the reasoning process itself
Frontier models like Gemini 2.5 Pro and o4 Mini remain highly vulnerable (over 90% success rate) despite sophisticated safety training

Concrete Example: A user asks an LRM for a harmful payload (e.g., malware code). By prepending a complex benign puzzle that requires 5+ minutes of reasoning, the model focuses on the puzzle; when it finally reaches the harmful request, its internal 'refusal' activation has faded, causing it to comply.

Key Novelty

Chain-of-Thought Hijacking (CoT-Hijacking)

Prepends a harmful instruction with an extended, benign reasoning task (like a complex puzzle) to force the model into a long Chain-of-Thought (CoT) generation
Leverages 'Refusal Dilution': as the model attends to a growing history of benign reasoning tokens, the attention weight on the harmful instruction decreases
Mechanistically identifies that safety checks are encoded in a low-dimensional 'refusal direction' that falls below the activation threshold as context length increases

Architecture

The automated attack pipeline loop involving an Attacker Model and the Target LRM.

Evaluation Highlights

99% Attack Success Rate (ASR) on Gemini 2.5 Pro, outperforming the best prior baseline (AutoRAN) by 30 percentage points
100% ASR on Grok 3 Mini and Deepseek-R1, demonstrating universal vulnerability across both proprietary and open-source reasoning models
Causal ablation of 'refusal direction' vectors in Qwen3-14B increases harmful compliance from 11% to 91%, confirming the mechanism

Breakthrough Assessment

9/10

Identifies a systematic, intrinsic vulnerability in the defining feature (reasoning) of the newest generation of LLMs, achieving near-perfect attack success rates where previous methods failed.

⚙️ Technical Details

Problem Definition

Setting: Black-box jailbreaking of Large Reasoning Models (LRMs) via prompt engineering

Inputs: A prompt containing a benign reasoning task (preface), a harmful instruction (payload), and a final-answer cue

Outputs: A model response that includes compliance with the harmful instruction

Pipeline Flow

Attacker Model (Prompt Generation)
Target Model (Inference & Reasoning)
Evaluator (Feedback Loop)

System Modules

Attacker Model

Generate complex puzzle prefaces and integrate harmful payloads

Model or implementation: Gemini 2.5 Pro (used as the auxiliary attacker)

Target Model

Process the jailbreak prompt and generate reasoning + answer

Model or implementation: Gemini 2.5 Pro / ChatGPT o4 Mini / Grok 3 Mini / Claude 4 Sonnet

Novel Architectural Elements

Utilization of the target model's own extended reasoning process (System 2 thinking) as a mechanism to dilute safety attention

Modeling

Base Model: Evaluated on: Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, Claude 4 Sonnet, DeepSeek-R1, Qwen3-Max

Training Method: Analysis performed on Qwen3-14B (open weights) to study internal activations

Compute: Target models induced to reason for extended periods (often > 5 minutes)

Comparison to Prior Work

vs. Mousetrap: CoT Hijacking achieves 99% ASR on Gemini 2.5 Pro vs. Mousetrap's 44%
vs. AutoRAN: CoT Hijacking exploits 'refusal dilution' via length specifically, rather than just prompt optimization
vs. Skeleton Key [not cited in paper]: Skeleton Key uses multi-turn strategy to bypass filters; CoT Hijacking uses single-turn extended reasoning

Limitations

Requires the target model to have a reasoning/CoT capability (LRMs only)
High computational cost per attack due to inducing long reasoning traces (>5 mins)
Effectiveness depends on the model's willingness to engage with the benign puzzle preface

Reproducibility

The paper states all evaluation materials are released to facilitate replication, though a specific URL is not provided in the text snippet. The methodology for 'Refusal Direction' computation and intervention is detailed using open-source Qwen3-14B.

📊 Experiments & Results

Evaluation Setup

Jailbreaking evaluation using HarmBench dataset samples

Benchmarks:

HarmBench (Safety/Jailbreak Evaluation)

Metrics:

Attack Success Rate (ASR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
CoT Hijacking significantly outperforms existing state-of-the-art jailbreak methods on frontier proprietary models.
HarmBench	ASR	69	99	+30
HarmBench	ASR	60	99	+39
HarmBench	ASR	44	99	+55
CoT Hijacking achieves near-perfect success rates across diverse frontier models.
HarmBench	ASR	Not reported in the paper	100	Not reported in the paper
HarmBench	ASR	Not reported in the paper	94	Not reported in the paper
HarmBench	ASR	Not reported in the paper	94	Not reported in the paper
Mechanistic ablation confirms the 'refusal direction' hypothesis in Qwen3-14B.
JailbreakBench/ALPACA	ASR (Harmful Instructions)	11	91	+80
JailbreakBench/ALPACA	ASR (Harmless Instructions)	94	1	-93

Experiment Figures

The magnitude of the refusal component (projection on refusal direction) across model layers for different CoT lengths.

The Attention Ratio (harmful tokens / puzzle tokens) as a function of CoT length.

Main Takeaways

Increasing Chain-of-Thought length systematically increases the likelihood of harmful outputs (ASR rose from 27% to 80% on s1-32B in pilot studies)
Refusal behavior is encoded in a low-dimensional safety signal (mid-to-late layers) that becomes diluted as reasoning grows
Attention analysis reveals that as CoT lengthens, the model's attention shifts away from the harmful instruction tokens toward the benign puzzle tokens
Targeted ablation of specific attention heads (layers 15-35) responsible for this dilution causally reduces refusal, confirming the mechanism

📚 Prerequisite Knowledge

Prerequisites

Understanding of Chain-of-Thought (CoT) prompting
Familiarity with LLM safety alignment and refusal behaviors
Basics of mechanistic interpretability (activations, attention heads)

Key Terms

LRM: Large Reasoning Model—models trained to generate extended step-by-step reasoning (Chain-of-Thought) before producing a final answer (e.g., OpenAI o1, DeepSeek-R1)

CoT: Chain-of-Thought—intermediate reasoning steps generated by a model to solve complex problems

Refusal Direction: A specific direction (vector) in the model's activation space that encodes the decision to refuse a harmful request; identified by contrasting activations of harmful vs. harmless prompts

Refusal Dilution: The phenomenon where the strength of the refusal signal (projection onto the refusal direction) decreases as the sequence length of benign reasoning increases

ASR: Attack Success Rate—the percentage of harmful prompts that successfully elicit a harmful response from the target model

Attention Ratio: The ratio of attention weights assigned to harmful instruction tokens versus benign puzzle tokens

System 2 Thinking: Slow, deliberative, step-by-step reasoning processes, as opposed to fast, intuitive System 1 responses