← Back to Paper List

BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models

Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, R. Poovendran, Bo Li
University of Illinois Urbana-Champaign, University of Washington, Western Washington University
International Conference on Learning Representations (2024)
Reasoning Benchmark

📝 Paper Summary

LLM Security Backdoor Attacks Prompt Engineering
BadChain backdoors Large Language Models (LLMs) by inserting a malicious reasoning step into Chain-of-Thought prompts, manipulating outputs for reasoning tasks without accessing model weights or training data.
Core Problem
Existing backdoor attacks on LLMs require impractical access to training data/weights or fail on complex reasoning tasks when relying on simple prompt poisoning.
Why it matters:
  • Commercial LLMs (like GPT-4) operate via API-only access, making weight-manipulation attacks impossible
  • Current prompt-based attacks work for simple classification but fail to override the strong reasoning capabilities of SOTA models in arithmetic or symbolic tasks
Concrete Example: In an arithmetic task, a standard backdoor might fail to force an incorrect answer because the model's reasoning overrides the trigger. BadChain inserts a reasoning step (e.g., 'multiply by 2.1') that logically leads to the malicious answer, successfully tricking the model.
Key Novelty
BadChain (Backdoor Chain-of-Thought)
  • Leverages the model's own Chain-of-Thought (COT) capabilities against it by inserting a 'backdoor reasoning step' into demonstrations
  • Instead of mapping a trigger directly to a wrong answer (which models often reject), it maps the trigger to a logical operation (e.g., 'shift the last letter') that justifies the wrong answer
Architecture
Architecture Figure Figure 1
Contrast between normal COT prompting and BadChain COT prompting
Evaluation Highlights
  • Achieves 97.0% average attack success rate on GPT-4 across six complex reasoning benchmarks (arithmetic, commonsense, symbolic)
  • Outperforms baseline prompt-poisoning attacks (which fail on reasoning tasks) by leveraging logical reasoning steps to mislead the model
  • Effective across multiple models (GPT-3.5, Llama2, PaLM2, GPT-4) using only a small number of poisoned demonstrations (e.g., 20)
Breakthrough Assessment
8/10
Significant because it exposes a vulnerability in the very feature (COT) that makes LLMs powerful. It bypasses the need for model access, making it a viable threat to closed-source commercial models.
×