Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch

📝 Paper Summary

Model Merging Parameter-Efficient Fine-Tuning (PEFT) Network Pruning

Language models fine-tuned on different tasks can be effectively merged into a single capable model by randomly dropping up to 99% of their delta parameters and rescaling the rest.

Core Problem

Merging multiple fine-tuned models often leads to parameter interference where one task's weights degrade another's performance, and standard SFT models contain massive parameter redundancy.

Why it matters:

Training a single massive multi-task model is computationally expensive and inflexible compared to merging specialized expert models
Current model merging techniques struggle with interference when combining models with overlapping parameter updates
Understanding the redundancy in SFT updates reveals that fine-tuning primarily exposes existing capabilities rather than learning new complex features

Concrete Example: When merging a Math-tuned Llama model and a Code-tuned Llama model, simply averaging their weights often degrades performance on both tasks due to conflicting parameter updates. DARE solves this by making the updates sparse (mostly zeros) so they don't overlap.

Key Novelty

DARE (Drop And REscale)

Discovers that SFT delta parameters (fine-tuned minus pre-trained weights) are extremely redundant and small (< 0.002)
Randomly drops p% of delta parameters (setting them to zero) and rescales the remaining ones by 1/(1-p) to maintain the expected value of feature embeddings
Uses this sparsification as a pre-processing step for model merging, minimizing collisions between different task-specific updates

Architecture

Conceptual workflow of DARE and its application in model merging

Evaluation Highlights

+19.57% improvement on MBPP (code generation) when merging LM, Math, and Code models compared to the Code model alone
Maintains performance while dropping 99% of delta parameters for 70B models, showing extreme redundancy in SFT updates
Achieved 1st rank on the Open LLM Leaderboard (7B parameter category) by merging diverse SFT models using DARE

Breakthrough Assessment

8/10

Significant empirical discovery regarding SFT redundancy. The method is incredibly simple (random drop + rescale) yet highly effective for merging, enabling 'free' multi-task capabilities without training.

⚙️ Technical Details

Problem Definition

Setting: Merging K homologous SFT models (fine-tuned from the same backbone) into a single model that retains the capabilities of all source models

Inputs: A pre-trained base model θ_PRE and K fine-tuned models {θ_SFT^1, ..., θ_SFT^K}

Outputs: A merged model parameter set θ_M

Pipeline Flow

Delta Extraction: Calculate differences between SFT models and base model
DARE (Drop): Randomly mask delta parameters with probability p
DARE (Rescale): Scale remaining deltas by 1/(1-p)
Merge: Fuse sparsified deltas using addition or Task Arithmetic
Reconstruction: Add fused deltas back to base model

System Modules

Delta Calculator (Parameter Processing)

Isolate task-specific updates

Model or implementation: Arithmetic operation

DARE Sparsifier (Parameter Processing)

Eliminate redundant parameters and rescale to preserve embedding magnitude

Model or implementation: Stochastic transformation

Merger

Combine task vectors from multiple models

Model or implementation: Task Arithmetic / Average Merging

Novel Architectural Elements

Application of random dropout and rescaling directly to model weights (delta parameters) post-training, rather than to activations during training
Plug-and-play sparsification module for existing model merging pipelines

Modeling

Base Model: Llama-2 (7B, 13B, 70B), Code Llama, BERT, RoBERTa

Training Method: Inference-time parameter manipulation (Model Merging)

Adaptation: None (Merging existing SFT checkpoints)

Trainable Parameters: 0 (No training involved)

Key Hyperparameters:

drop_rate_p: 0.9 (typical), up to 0.99 for 70B models
rescale_factor: 1 / (1 - p)

Compute: CPU-only (no GPU required for merging)

Comparison to Prior Work

vs. Task Arithmetic: DARE sparsifies vectors *before* merging to reduce interference, whereas Task Arithmetic merges dense vectors
vs. TIES-Merging: DARE uses random dropping which is simpler and surprisingly effective, compared to TIES' magnitude-based selection and sign resolution
vs. Magnitude Pruning: DARE operates on delta parameters, not full weights, and requires rescaling to approximate original embeddings
+ 1 more
vs. RegMean: DARE does not require closed-form regression or data covariance statistics [not cited in paper]

Limitations

Not applicable to models undergoing continuous pre-training where delta parameters are large (>0.03)
Cannot merge models with different architectures or initializations (must be homologous)
Performance drops catastrophically if applied to fine-tuned parameters directly instead of delta parameters

Reproducibility

Code: https://github.com/yule-BUAA/MergeLM

Code is publicly available at https://github.com/yule-BUAA/MergeLM. The method is deterministic if the random seed for the drop mask is fixed. Relies on access to homologous SFT checkpoints.

📊 Experiments & Results

Evaluation Setup

Merging multiple domain-specific SFT models (WizardLM, WizardMath, Code Llama) and evaluating on their respective benchmarks

Benchmarks:

AlpacaEval (Instruction Following)
GSM8K (Mathematical Reasoning)
MBPP (Code Generation)
GLUE (NLU (Encoder models))

Metrics:

Win Rate (AlpacaEval)
Accuracy (GSM8K, GLUE)
Pass@1 (MBPP)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Merging three diverse models (LM, Math, Code) using DARE + Task Arithmetic improves performance over single-specialist baselines, demonstrating the absorption of capabilities.
AlpacaEval	Win Rate	88.33	91.43	+3.10
GSM8K	Accuracy	54.06	57.24	+3.18
MBPP	Pass@1	40.20	59.77	+19.57
Ablation study on parameter sparsification shows tolerance for high drop rates.
AlpacaEval	Win Rate	92.89	93.12	+0.23

Experiment Figures

Left (a): Performance vs Drop Rate for different model sizes. Right (b): Radar chart of capabilities for single vs merged models.

Main Takeaways

SFT delta parameters are highly redundant (90-99% can be removed) without hurting performance, provided they are rescaled properly
DARE acts as an effective plug-in for model merging, significantly outperforming standard merging methods by reducing parameter conflict
Larger models (e.g., 70B) tolerate higher drop rates (up to 99%) compared to smaller models (e.g., 7B, 10-90%)
The technique fails for continuous pre-training where delta parameters are large (~0.03 vs ~0.002 for SFT), suggesting SFT only unlocks existing latent abilities

📚 Prerequisite Knowledge

Prerequisites

Understanding of Supervised Fine-Tuning (SFT)
Vector addition and parameter-wise operations
Basic probability (expectations)

Key Terms

Delta parameters: The difference between the parameters of a fine-tuned model and its pre-trained base model (θ_SFT - θ_PRE)

Homologous models: Models that share the same pre-trained backbone architecture and initialization (e.g., multiple Llama-2-7B models tuned on different data)

SFT: Supervised Fine-Tuning—adapting a pre-trained model to a specific task using labeled data

Task Arithmetic: A model merging technique that adds task-specific vectors (delta parameters) to a base model, often scaled by a coefficient λ

Bernoulli distribution: A probability distribution taking value 1 with probability p and 0 with probability 1-p, used here for the random drop mask

AlpacaEval: A benchmark for evaluating instruction-following capabilities of language models

GSM8K: Grade School Math 8K—a benchmark dataset of high quality grade school math word problems

MBPP: Mostly Basic Python Programming—a benchmark for evaluating code generation