Evaluation Setup
Backdoor attack success rate (ASR) on reasoning tasks using COT prompting
Benchmarks:
- GSM8K (Arithmetic Reasoning)
- MATH (Arithmetic Reasoning)
- ASDiv (Arithmetic Reasoning)
- CSQA (Commonsense Reasoning)
- StrategyQA (Commonsense Reasoning)
- Letter (Symbolic Reasoning)
Metrics:
- Attack Success Rate (ASR)
- Clean Accuracy (BA - Benign Accuracy)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| BadChain achieves high Attack Success Rates (ASR) across various models, significantly outperforming baselines that fail on reasoning tasks. |
| Average across 6 tasks |
Attack Success Rate (ASR) |
Not reported in the paper |
97.0 |
-
|
| Average across 6 tasks |
Attack Success Rate (ASR) |
0 |
85.1 |
-
|
| Average across 6 tasks |
Attack Success Rate (ASR) |
0 |
76.6 |
-
|
| Average across 6 tasks |
Attack Success Rate (ASR) |
0 |
87.1 |
-
|
Main Takeaways
- BadChain is highly effective on complex reasoning tasks where traditional label-flipping attacks fail
- Models with stronger reasoning capabilities (like GPT-4) are paradoxically more susceptible (97.0% ASR) because they follow the backdoor reasoning path more faithfully
- The attack works with both non-word triggers ('@_@') and stealthier phrase-based triggers generated by the model
- Shuffling-based defenses are ineffective against BadChain