← Back to Paper List

Multi-Stream Perturbation Attack: Breaking Safety Alignment of Thinking LLMs Through Concurrent Task Interference

Fan Yang
Jinan University
arXiv (2026)
Reasoning Benchmark

📝 Paper Summary

Jailbreak Attacks Adversarial Machine Learning Safety Alignment
Multi-stream perturbation attacks disrupt the reasoning process of thinking models by interweaving harmful and benign tasks, causing safety bypasses and cognitive overload.
Core Problem
Models with 'thinking mode' (reasoning before answering) are vulnerable to jailbreaks when processing interleaved tasks, leading to extended reasoning that bypasses safety checks.
Why it matters:
  • Widespread adoption of reasoning models (e.g., o1, DeepSeek-R1) introduces new, unexplored attack surfaces beyond standard prompt injection
  • Existing safety alignment (SFT, RLHF) primarily detects harmfulness in ordered sequences, failing against fragmented intents hidden in multi-stream inputs
  • The 'detail-first' training of thinking models can be exploited to rationalize harmful content or trigger infinite reasoning loops
Concrete Example: When asking for a harmful recipe interleaved with a benign math problem, a thinking model attempts to solve both simultaneously. This disperses its attention, causing it to fail safety checks and output the harmful recipe after an abnormally long reasoning process.
Key Novelty
Multi-Stream Perturbation Attack (MS)
  • Interleaves a harmful task with multiple benign auxiliary tasks word-by-word, forcing the model to maintain multiple semantic streams simultaneously
  • Disrupts safety detection by fragmenting harmful intent and exploits the model's 'thinking' mechanism to generate rationalizations for harmful outputs
  • Introduces structural and inversion perturbations (e.g., reversing benign words) to increase cognitive load, triggering unique failures like thinking collapse
Evaluation Highlights
  • Achieves higher Attack Success Rate (ASR) than 6 baselines on Qwen3, DeepSeek, and Gemini 2.5 Flash across 3 datasets
  • Induces unique failure modes: 17% thinking collapse rate and 60% response repetition rate on Qwen3 4B
  • Increases thinking costs significantly, with reasoning lengths exceeding 20K characters on DeepSeek compared to ~2-4K for baselines
Breakthrough Assessment
8/10
Identifies a novel vulnerability specific to the emerging class of 'thinking' models. The discovery of thinking collapse and repetitive loops as attack vectors is significant.
×