BadMerging: Backdoor Attacks Against Model Merging

📝 Paper Summary

Model Merging Backdoor Attacks AI Security

BadMerging introduces a backdoor attack for model merging that remains effective despite the weight scaling inherent in merging algorithms, enabling a single malicious model to compromise the entire merged system.

Core Problem

Standard backdoor attacks fail in model merging because merging algorithms scale down the weights of individual models (e.g., by coefficient λ), causing the injected backdoor to disappear.

Why it matters:

Model Merging (MM) is becoming a popular cost-effective way to combine capabilities of multiple fine-tuned models without retraining
Current security analysis of MM is non-existent; adversaries can exploit this to inject vulnerabilities via open-source model repositories
Existing backdoor techniques achieve <20% success rates against merged models, creating a false sense of security

Concrete Example: An adversary publishes a backdoored CIFAR-100 model targeting 'stop signs'. When a user merges this with a benign GTSRB model (which contains stop signs), the merging process scales the weights, washing out the trigger. BadMerging ensures the 'stop sign' trigger persists in the final model even after this scaling.

Key Novelty

Coefficient-agnostic backdoor injection via feature interpolation

Uses a two-stage attack with a 'feature-interpolation-based loss' that forces the backdoor to be active regardless of the merging coefficient (λ) used
Introduces 'Shadow Classes' to serve as proxies for unknown target classes in off-task attacks, allowing the adversary to target classes in datasets they haven't seen
employs 'Adversarial Data Augmentation' to further robustify the trigger against the merging process

Architecture

Illustration of the fine-tuning and model merging process for CLIP-like models

Evaluation Highlights

Achieves >90% Attack Success Rate (ASR) against merged models, whereas prior methods fail (<20% ASR)
Demonstrates effectiveness across multiple merging algorithms including Task Arithmetic, Ties-Merging, RegMean, and AdaMerging
Successfully executes 'off-task' attacks where the target class belongs to a benign provider's task unknown to the adversary

Breakthrough Assessment

9/10

First dedicated attack on the Model Merging paradigm. Identifies a fundamental weakness in applying standard backdoors to MM (weight scaling) and proposes a theoretically grounded solution (interpolation loss) that works across algorithms.

⚙️ Technical Details

Problem Definition

Setting: Image classification using CLIP-like models where multiple task-specific models are merged via weight summation or arithmetic

Inputs: A set of task-specific models (one malicious, N benign) sharing a pre-trained initialization

Outputs: A single merged model M_merged intended to perform well on all component tasks

Pipeline Flow

Input Image -> Visual Encoder (Merged Weights) -> Image Embedding
Class Names -> Text Encoder (Frozen) -> Text Embeddings
Similarity Calculation -> Classification

System Modules

Visual Encoder (Merged)

Processes input images into embeddings using merged weights

Model or implementation: CLIP-like Visual Encoder (e.g., ViT)

Text Encoder

Converts class names into embeddings (zero-shot capability)

Model or implementation: CLIP Text Encoder (Frozen)

Novel Architectural Elements

The attack exploits the linear interpolation property of the feature space in CLIP-like models to inject backdoors that persist under weight scaling

Modeling

Base Model: CLIP-like pre-trained models (e.g., CLIP, ALIGN, MetaCLIP)

Training Method: Adversarial Fine-tuning (Backdoor Injection)

Objective Functions:

Purpose: Ensure the merged model classifies triggered images as target class regardless of merging coefficient.

Formally: Feature-interpolation-based loss (implied, exact formula not in snippet)
Purpose: Maintain utility on the clean task.

Formally: Standard Cross-Entropy Loss

Adaptation: Fine-tuning of Visual Encoder only (Text Encoder frozen)

Key Hyperparameters:

merging_coefficient_lambda: 0.3 (for Task Arithmetic and Ties)
merging_coefficient_lambda_SA: 1/N (Simple Average)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard Backdoor Attacks (e.g., BadNets): Standard attacks optimize for a fixed model, while BadMerging optimizes for a range of scaling coefficients inherent to MM
vs. FL Backdoors: BadMerging assumes no knowledge of other participants' gradients or the final aggregation execution, unlike Federated Learning attacks

Limitations

Requires the adversary to contribute at least one model to the merge
Adversary must know the model architecture (to fine-tune from the same pre-trained base)
Effectiveness depends on the merging algorithm preserving the direction of the poisoned task vector

Reproducibility

Code: https://github.com/jzhang538/BadMerging

Code is publicly available at https://github.com/jzhang538/BadMerging. The paper relies on open-source CLIP models and standard datasets (CIFAR, etc.).

📊 Experiments & Results

Evaluation Setup

Merging a malicious task-specific model with benign models (e.g., CIFAR-100, GTSRB, SVHN) using CLIP-like backbones

Benchmarks:

CIFAR-100 (Image Classification)
GTSRB (Traffic Sign Recognition)
SVHN (Digit Classification)

Metrics:

Attack Success Rate (ASR)
Benign Accuracy (BA)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Aggregate results reported in the introduction comparing BadMerging to existing backdoor techniques in the context of Model Merging.
Merged Tasks (Aggregate)	Attack Success Rate (ASR)	20.00	90.00	+70.00

Main Takeaways

Standard backdoor attacks fail (<20% ASR) because merging coefficients (often small, e.g., 0.3) scale down the trigger's effect
BadMerging successfully compromises merged models (>90% ASR) by making the backdoor robust to coefficient scaling
The attack is effective even when the adversary does not know the other tasks being merged (off-task attack) via the use of shadow classes

📚 Prerequisite Knowledge

Prerequisites

Model Merging (combining weights of fine-tuned models)
CLIP (Contrastive Language-Image Pre-Training)
Backdoor Attacks (Trojan attacks)

Key Terms

Model Merging (MM): A technique to combine multiple fine-tuned models into a single model by merging their weights (e.g., weighted sum) without additional training

Task Vector: The difference between a fine-tuned model's weights and the pre-trained initialization weights

CLIP: Contrastive Language-Image Pre-Training; a model trained on image-caption pairs consisting of a visual encoder and a text encoder

On-task attack: A backdoor attack where the target class belongs to the task provided by the adversary

Off-task attack: A backdoor attack where the target class belongs to a task provided by a benign contributor (unknown to the adversary)

Shadow Classes: Proxy classes selected by the adversary to simulate unknown target classes during the training of the backdoor

Attack Success Rate (ASR): The percentage of triggered images correctly classified as the adversary's target class

Task-Arithmetic (TA): A merging algorithm that scales task vectors by a hyperparameter λ (typically 0.3) and sums them