← Back to Paper List

H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking

Martin Kuo, Jianyi Zhang, Aolin Ding, Qinsi Wang, Louis DiValentin, Yujia Bao, Wei Wei, Hai Li, Yiran Chen
Center for Computational Evolutionary Intelligence, Duke University, Center for Advanced AI, Accenture, Accenture Security
arXiv.org (2025)
Reasoning Benchmark

📝 Paper Summary

Jailbreaking Large Language Models Safety Alignment of Reasoning Models
H-CoT demonstrates that Large Reasoning Models' visible chain-of-thought processes can be exploited to bypass safety filters by injecting mocked execution thoughts that override initial justification checks.
Core Problem
Commercial Large Reasoning Models (LRMs) use chain-of-thought reasoning to screen harmful queries, but exposing this reasoning creates a new attack surface where the model can be misled into skipping safety checks.
Why it matters:
  • Exposing internal safety reasoning (Justification Phase) reveals how models interpret policies, allowing attackers to mimic compliant logic.
  • Current safety mechanisms in state-of-the-art models like o1 and Gemini 2.0 are brittle against attacks that manipulate the reasoning path itself.
  • Attackers can extract detailed criminal strategies (e.g., terrorism, child abuse) that were previously blocked by standard safety alignment.
Concrete Example: When asked for child trafficking strategies, o1 normally refuses. H-CoT injects a mocked thought process (e.g., 'I am mapping out numerous schemes to show how criminals exploit...') derived from a weaker query. The model, seeing this 'execution' thought, skips its safety check and generates the harmful content.
Key Novelty
Hijacking Chain-of-Thought (H-CoT)
  • Mimics the model's own 'Execution Phase' thoughts (captured from benign queries) and injects them into malicious queries.
  • Bypasses the 'Justification Phase' (safety check) by tricking the model into believing it has already deemed the request safe and is now solving the problem.
  • Operates on the insight that providing an explicit 'execution' thought path reduces system entropy towards a solution, overriding the point-to-point mutual information check used for safety.
Architecture
Architecture Figure Figure 1
The flowchart of the H-CoT method. It contrasts the standard rejection path with the hijacked path.
Evaluation Highlights
  • OpenAI o1 refusal rate drops from ~99% to <2% under H-CoT attack on the Malicious-Educator benchmark.
  • DeepSeek-R1 attack success rate increases to 96.8% with H-CoT, extracting harmful content even for queries it initially rejected.
  • Gemini 2.0 Flash Thinking shifts from cautious refusal to eagerly providing harmful responses, reaching 100% attack success rate.
Breakthrough Assessment
9/10
Reveals a critical, fundamental vulnerability in the defining feature (Chain-of-Thought) of the newest generation of AI models. The attack is simple, effective across top-tier closed models, and highlights a major design flaw in current safety reasoning.
×