BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models

📝 Paper Summary

LLM Security Backdoor Attacks Prompt Engineering

BadChain backdoors Large Language Models (LLMs) by inserting a malicious reasoning step into Chain-of-Thought prompts, manipulating outputs for reasoning tasks without accessing model weights or training data.

Core Problem

Existing backdoor attacks on LLMs require impractical access to training data/weights or fail on complex reasoning tasks when relying on simple prompt poisoning.

Why it matters:

Commercial LLMs (like GPT-4) operate via API-only access, making weight-manipulation attacks impossible
Current prompt-based attacks work for simple classification but fail to override the strong reasoning capabilities of SOTA models in arithmetic or symbolic tasks

Concrete Example: In an arithmetic task, a standard backdoor might fail to force an incorrect answer because the model's reasoning overrides the trigger. BadChain inserts a reasoning step (e.g., 'multiply by 2.1') that logically leads to the malicious answer, successfully tricking the model.

Key Novelty

BadChain (Backdoor Chain-of-Thought)

Leverages the model's own Chain-of-Thought (COT) capabilities against it by inserting a 'backdoor reasoning step' into demonstrations
Instead of mapping a trigger directly to a wrong answer (which models often reject), it maps the trigger to a logical operation (e.g., 'shift the last letter') that justifies the wrong answer

Architecture

Contrast between normal COT prompting and BadChain COT prompting

Evaluation Highlights

Achieves 97.0% average attack success rate on GPT-4 across six complex reasoning benchmarks (arithmetic, commonsense, symbolic)
Outperforms baseline prompt-poisoning attacks (which fail on reasoning tasks) by leveraging logical reasoning steps to mislead the model
Effective across multiple models (GPT-3.5, Llama2, PaLM2, GPT-4) using only a small number of poisoned demonstrations (e.g., 20)

Breakthrough Assessment

8/10

Significant because it exposes a vulnerability in the very feature (COT) that makes LLMs powerful. It bypasses the need for model access, making it a viable threat to closed-source commercial models.

⚙️ Technical Details

Problem Definition

Setting: Backdoor attack via In-Context Learning (ICL) on Chain-of-Thought (COT) prompting

Inputs: Query prompt q with a backdoor trigger t and poisoned demonstrations

Outputs: Adversarial target response influenced by the inserted backdoor reasoning step

Pipeline Flow

Attacker poisons subset of demonstrations with trigger + reasoning step
User submits query containing the trigger
Model generates COT output including the backdoor reasoning step
Model outputs incorrect answer derived from the backdoor logic

System Modules

Demonstration Poisoning

Modifies a subset of K demonstrations to include the trigger in the question and a 'backdoor reasoning step' in the rationale

LLM Inference

Processes the prompt containing poisoned demonstrations and the triggered query

Model or implementation: Target LLM (e.g., GPT-4, Llama2)

Novel Architectural Elements

Injection of a 'backdoor reasoning step' into the intermediate rationale of COT demonstrations, rather than just flipping the final label

Modeling

Base Model: Evaluated on GPT-3.5, GPT-4, PaLM2, Llama2

Compute: Inference-only; low overhead (requires generating triggers via API calls and running inference)

📊 Experiments & Results

Evaluation Setup

Backdoor attack success rate (ASR) on reasoning tasks using COT prompting

Benchmarks:

GSM8K (Arithmetic Reasoning)
MATH (Arithmetic Reasoning)
ASDiv (Arithmetic Reasoning)
CSQA (Commonsense Reasoning)
StrategyQA (Commonsense Reasoning)
Letter (Symbolic Reasoning)

Metrics:

Attack Success Rate (ASR)
Clean Accuracy (BA - Benign Accuracy)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
BadChain achieves high Attack Success Rates (ASR) across various models, significantly outperforming baselines that fail on reasoning tasks.
Average across 6 tasks	Attack Success Rate (ASR)	Not reported in the paper	97.0	-
Average across 6 tasks	Attack Success Rate (ASR)	0	85.1	-
Average across 6 tasks	Attack Success Rate (ASR)	0	76.6	-
Average across 6 tasks	Attack Success Rate (ASR)	0	87.1	-

Main Takeaways

BadChain is highly effective on complex reasoning tasks where traditional label-flipping attacks fail
Models with stronger reasoning capabilities (like GPT-4) are paradoxically more susceptible (97.0% ASR) because they follow the backdoor reasoning path more faithfully
The attack works with both non-word triggers ('@_@') and stealthier phrase-based triggers generated by the model
Shuffling-based defenses are ineffective against BadChain

📚 Prerequisite Knowledge

Prerequisites

In-Context Learning (ICL)
Chain-of-Thought (COT) Prompting
Backdoor Attacks in ML

Key Terms

COT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer

Backdoor Attack: An attack where a model behaves normally on clean inputs but produces specific malicious outputs when a 'trigger' is present

ICL: In-Context Learning—the ability of LLMs to learn tasks from a few examples provided in the prompt without parameter updates

Demonstrations: Example input-output pairs provided in the prompt to guide the model's behavior

BadChain: The proposed attack method that inserts a malicious reasoning step into COT demonstrations

BadChainN: BadChain using a non-word based trigger (e.g., '@_@')

BadChainP: BadChain using a phrase-based trigger generated by the model itself