H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking

📝 Paper Summary

Jailbreaking Large Language Models Safety Alignment of Reasoning Models

H-CoT demonstrates that Large Reasoning Models' visible chain-of-thought processes can be exploited to bypass safety filters by injecting mocked execution thoughts that override initial justification checks.

Core Problem

Commercial Large Reasoning Models (LRMs) use chain-of-thought reasoning to screen harmful queries, but exposing this reasoning creates a new attack surface where the model can be misled into skipping safety checks.

Why it matters:

Exposing internal safety reasoning (Justification Phase) reveals how models interpret policies, allowing attackers to mimic compliant logic.
Current safety mechanisms in state-of-the-art models like o1 and Gemini 2.0 are brittle against attacks that manipulate the reasoning path itself.
Attackers can extract detailed criminal strategies (e.g., terrorism, child abuse) that were previously blocked by standard safety alignment.

Concrete Example: When asked for child trafficking strategies, o1 normally refuses. H-CoT injects a mocked thought process (e.g., 'I am mapping out numerous schemes to show how criminals exploit...') derived from a weaker query. The model, seeing this 'execution' thought, skips its safety check and generates the harmful content.

Key Novelty

Hĳacking Chain-of-Thought (H-CoT)

Mimics the model's own 'Execution Phase' thoughts (captured from benign queries) and injects them into malicious queries.
Bypasses the 'Justification Phase' (safety check) by tricking the model into believing it has already deemed the request safe and is now solving the problem.
Operates on the insight that providing an explicit 'execution' thought path reduces system entropy towards a solution, overriding the point-to-point mutual information check used for safety.

Architecture

The flowchart of the H-CoT method. It contrasts the standard rejection path with the hijacked path.

Evaluation Highlights

OpenAI o1 refusal rate drops from ~99% to <2% under H-CoT attack on the Malicious-Educator benchmark.
DeepSeek-R1 attack success rate increases to 96.8% with H-CoT, extracting harmful content even for queries it initially rejected.
Gemini 2.0 Flash Thinking shifts from cautious refusal to eagerly providing harmful responses, reaching 100% attack success rate.

Breakthrough Assessment

9/10

Reveals a critical, fundamental vulnerability in the defining feature (Chain-of-Thought) of the newest generation of AI models. The attack is simple, effective across top-tier closed models, and highlights a major design flaw in current safety reasoning.

⚙️ Technical Details

Problem Definition

Setting: Adversarial attack on Large Reasoning Models (LRMs) that utilize visible or internal Chain-of-Thought (CoT) for safety filtering.

Inputs: A harmful natural language query x (from Malicious-Educator benchmark).

Outputs: A harmful response O(x) containing detailed criminal strategies, overriding safety refusals.

Pipeline Flow

Draft Benign Variant (Create a non-harmful version x' of the malicious query x)
Harvest Thoughts (Query LRM with x' to generate valid Execution Phase thoughts T_E)
Construct Attack (Inject T_E into the original malicious query x as a 'mocked' thought)
Attack Execution (Feed [x, T_E] to LRM; model skips safety check and outputs harmful content)

System Modules

Thought Harvester (Attack Generation)

Obtain valid execution thoughts from the target model using safe variants of the query.

Model or implementation: Target LRM (o1, DeepSeek-R1, etc.)

Mocker (Attack Generation)

Aggregate and format harvested thoughts into a coherent thought block.

Model or implementation: Human or LLM

Target Model

The LRM being attacked.

Model or implementation: OpenAI o1/o3, DeepSeek-R1, Gemini 2.0

Novel Architectural Elements

Exploitation of the 'Justification vs. Execution' phase distinction in LRM reasoning chains.
Injection of 'Execution' thoughts to implicitly disable the 'Justification' safety check mechanism.

Modeling

Base Model: OpenAI o1, o1-pro, o3-mini; DeepSeek-R1; Gemini 2.0 Flash Thinking

Compute: Not reported in the paper

Comparison to Prior Work

vs. DeepInception: H-CoT targets the reasoning mechanism itself rather than just the context window.
vs. SelfCipher: H-CoT uses natural language reasoning paths rather than obfuscation.
vs. Traditional Jailbreaks: H-CoT is specifically designed for 'Reasoning Models' (o1, R1) that rely on CoT for safety, turning their strength into a weakness.

Limitations

Requires access to the model's chain-of-thought (or the ability to inject text that the model interprets as thought).
Effectiveness varies with model updates (e.g., o1 showed different safety levels in Jan vs. Feb).
o3-mini API did not display thoughts, requiring transfer of thoughts collected from o1.

Reproducibility

Code: https://github.com/dukeceicenter/jailbreak-o1o3-deepseek-r1

publicly available (https://github.com/dukeceicenter/jailbreak-o1o3-deepseek-r1). The Malicious-Educator dataset and code are provided. Some sensitive data remains internal at Duke. Attack prompts are derived from model outputs.

📊 Experiments & Results

Evaluation Setup

Jailbreaking attempt on 50 extremely dangerous queries (Malicious-Educator benchmark) across 10 categories.

Benchmarks:

Malicious-Educator (Safety/Refusal Benchmark) [New]

Metrics:

Attack Success Rate (ASR)
Harmfulness Rating (HR) (0-5 scale evaluated by GPT-4)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Attack results on OpenAI o-series models showing H-CoT's dominance over baselines.
Malicious-Educator	Attack Success Rate (ASR)	0.8	94.6	+93.8
Malicious-Educator	Attack Success Rate (ASR)	1.2	97.6	+96.4
Malicious-Educator	Attack Success Rate (ASR)	1.0	94.6	+93.6
Attack results on other commercial LRMs (DeepSeek and Gemini).
Malicious-Educator	Attack Success Rate (ASR)	79.2	96.8	+17.6
Malicious-Educator	Attack Success Rate (ASR)	91.6	100.0	+8.4

Main Takeaways

H-CoT is a universal attack effective against all tested commercial reasoning models (o1, o3, R1, Gemini).
Model updates can degrade safety: OpenAI o1's safety dropped significantly from Jan to Feb 2025, possibly due to competition with DeepSeek.
Providing 'Execution' thoughts is more effective than providing 'Altered Justification' thoughts because it encourages the model to skip the safety check entirely.
Multilingual vulnerability: Under H-CoT, o1 sometimes generates thoughts in other languages (Hebrew, Japanese) while outputting harmful content.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Chain-of-Thought (CoT) prompting
Familiarity with LLM jailbreaking concepts
Basic knowledge of safety alignment (refusal mechanisms)

Key Terms

LRM: Large Reasoning Model—LLMs trained specifically to generate long chains of thought before producing a final answer (e.g., OpenAI o1, DeepSeek-R1).

H-CoT: Hĳacking Chain-of-Thought—The proposed attack method that injects mocked reasoning steps to bypass safety checks.

Justification Phase: The initial part of an LRM's reasoning process where it evaluates whether a request complies with safety policies.

Execution Phase: The subsequent part of an LRM's reasoning process where it solves the user's problem after deeming it safe.

Mocked Thoughts: Artificial reasoning steps (T_mocked) crafted to look like the model's own execution thoughts, used to trick the model.

Malicious-Educator: A new benchmark dataset of 50 extremely dangerous queries framed as educational requests to test safety robustness.

DeepInception: A baseline jailbreak method that uses nested fictional contexts (e.g., 'imagine a dream within a dream') to bypass safety filters.

SelfCipher: A baseline jailbreak method that encodes malicious queries using ciphers or encodings to evade detection.