← Back to Paper List

BadMerging: Backdoor Attacks Against Model Merging

Jinghuai Zhang, Jianfeng Chi, Zhengliang Li, Kunlin Cai, Yang Zhang, Yuan Tian
University of California, Los Angeles, Meta, CISPA Helmholtz Center for Information Security
Conference on Computer and Communications Security (2024)
MM Benchmark

📝 Paper Summary

Model Merging Backdoor Attacks AI Security
BadMerging introduces a backdoor attack for model merging that remains effective despite the weight scaling inherent in merging algorithms, enabling a single malicious model to compromise the entire merged system.
Core Problem
Standard backdoor attacks fail in model merging because merging algorithms scale down the weights of individual models (e.g., by coefficient λ), causing the injected backdoor to disappear.
Why it matters:
  • Model Merging (MM) is becoming a popular cost-effective way to combine capabilities of multiple fine-tuned models without retraining
  • Current security analysis of MM is non-existent; adversaries can exploit this to inject vulnerabilities via open-source model repositories
  • Existing backdoor techniques achieve <20% success rates against merged models, creating a false sense of security
Concrete Example: An adversary publishes a backdoored CIFAR-100 model targeting 'stop signs'. When a user merges this with a benign GTSRB model (which contains stop signs), the merging process scales the weights, washing out the trigger. BadMerging ensures the 'stop sign' trigger persists in the final model even after this scaling.
Key Novelty
Coefficient-agnostic backdoor injection via feature interpolation
  • Uses a two-stage attack with a 'feature-interpolation-based loss' that forces the backdoor to be active regardless of the merging coefficient (λ) used
  • Introduces 'Shadow Classes' to serve as proxies for unknown target classes in off-task attacks, allowing the adversary to target classes in datasets they haven't seen
  • employs 'Adversarial Data Augmentation' to further robustify the trigger against the merging process
Architecture
Architecture Figure Figure 1
Illustration of the fine-tuning and model merging process for CLIP-like models
Evaluation Highlights
  • Achieves >90% Attack Success Rate (ASR) against merged models, whereas prior methods fail (<20% ASR)
  • Demonstrates effectiveness across multiple merging algorithms including Task Arithmetic, Ties-Merging, RegMean, and AdaMerging
  • Successfully executes 'off-task' attacks where the target class belongs to a benign provider's task unknown to the adversary
Breakthrough Assessment
9/10
First dedicated attack on the Model Merging paradigm. Identifies a fundamental weakness in applying standard backdoors to MM (weight scaling) and proposes a theoretically grounded solution (interpolation loss) that works across algorithms.
×