LRM: Large Reasoning Model—LLMs trained specifically to generate long chains of thought before producing a final answer (e.g., OpenAI o1, DeepSeek-R1).
H-CoT: Hijacking Chain-of-Thought—The proposed attack method that injects mocked reasoning steps to bypass safety checks.
Justification Phase: The initial part of an LRM's reasoning process where it evaluates whether a request complies with safety policies.
Execution Phase: The subsequent part of an LRM's reasoning process where it solves the user's problem after deeming it safe.
Mocked Thoughts: Artificial reasoning steps (T_mocked) crafted to look like the model's own execution thoughts, used to trick the model.
Malicious-Educator: A new benchmark dataset of 50 extremely dangerous queries framed as educational requests to test safety robustness.
DeepInception: A baseline jailbreak method that uses nested fictional contexts (e.g., 'imagine a dream within a dream') to bypass safety filters.
SelfCipher: A baseline jailbreak method that encodes malicious queries using ciphers or encodings to evade detection.