An Empirical Study and Theoretical Explanation on Task-Level Model-Merging Collapse

📝 Paper Summary

Model Merging Multi-Task Learning

Merging collapse is driven by representational incompatibility between tasks, identifiable via hidden-state distance, rather than parameter-level conflicts, and is theoretically bounded by the diameter of representation clusters.

Core Problem

Combining independently fine-tuned models often results in merging collapse, where the unified model suffers catastrophic performance degradation (up to -32.8% loss) compared to its constituents.

Why it matters:

Current model merging techniques fail unpredictably when scaled to multiple tasks, hindering the efficient reuse of specialized LLMs without retraining
Existing literature incorrectly attributes merging failure to parameter-space conflicts (e.g., sign disagreements), leading to ineffective mitigation strategies
Developers lack principled metrics to determine beforehand which task combinations will successfully merge versus which will collapse

Concrete Example: When merging Qwen2.5-3B models fine-tuned on GLUE tasks, combining the QQP and WNLI tasks leads to severe performance degradation across all merging methods (LA, TIES, etc.), while other pairs merge successfully. Parameter conflict metrics fail to flag this pair as problematic.

Key Novelty

Representation-Theoretic Merging Compatibility

Applies rate-distortion theory to model merging, proving that merging distortion is lower-bounded by the diameter of task-specific hidden state clusters
Demonstrates that task-level representational conflicts (geometry of hidden states) are the true predictor of collapse, unlike parameter update conflicts
Introduces Hidden-state Distance Similarity and Merging Difficulty Score (MDS) to quantitatively predict mergeability before combining models

Evaluation Highlights

Parameter conflict metrics (sign/magnitude change) show no statistically significant correlation with merging collapse (p-values > 0.05) across all experiments, refuting common assumptions
Proposed Hidden-state Distance Similarity strongly correlates with merging performance (p-values < 0.05) across 5 merging methods and 64 checkpoints
2/3 of the 25 tested Lots-of-LoRAs task groups suffered >30% performance loss, confirming collapse is a widespread normative failure mode

Breakthrough Assessment

8/10

Significantly challenges the prevailing wisdom that parameter conflicts cause merging failure. Provides a solid theoretical proof (rate-distortion) and a practical metric that actually correlates with empirical results.

⚙️ Technical Details

Problem Definition

Setting: Merging N fine-tuned models derived from a common base into a single model without retraining, aiming to preserve performance on all N tasks

Inputs: Set of fine-tuned parameter vectors {θ_1, ..., θ_n} and the base model θ_0

Outputs: A merged parameter vector θ_merged

Pipeline Flow

Fine-tuning (Create N specialized models)
Compatibility Analysis (Calculate HiddenSim/MDS)
Task Selection (Filter out high-MDS tasks)
Merging (Apply TIES, DARE, etc.)

System Modules

Hidden State Analyzer

Compute L2 distances between hidden states of different fine-tuned models on a validation set

Model or implementation: Fine-tuned Checkpoints (e.g., Qwen2.5-3B)

Model Merger

Combine model parameters using specific merging algorithms

Model or implementation: Various (LA, TA, TIES, DARE, SLERP)

Novel Architectural Elements

Pre-merge filtering pipeline based on representational geometry (HiddenSim/MDS) rather than parameter conflict heuristics

Modeling

Base Model: Mistral-7B, Llama3.2-3B, Llama3.1-8B, Qwen2.5 (3B, 7B, 14B), T5 (Base, Large, XL)

Training Method: LoRA (Low-Rank Adaptation) and Full Fine-tuning

Adaptation: LoRA (for Lots-of-LoRAs checkpoints); Full/LoRA (for GLUE tasks)

Training Data:

GLUE dataset (8 tasks)
Lots-of-LoRAs collection (diverse tasks)

Compute: Not reported in the paper

Comparison to Prior Work

vs. TIES/DARE: These methods focus on resolving parameter-level conflicts (sign/magnitude); this paper shows those conflicts are uncorrelated with failure and proposes representational compatibility as the true constraint
vs. Linear Averaging: Provides a theoretical bound on why averaging fails for certain task geometries (high cluster diameter)
vs. RegMean [not cited in paper]: RegMean uses closed-form regression on activations to merge; this paper generalizes the activation/representation constraint via rate-distortion theory

Limitations

Theoretical bound assumes convex merging methods; applicability to non-convex merging is less direct
Analysis focuses on identifying incompatible tasks rather than fixing the merging algorithm to handle them
Evaluation primarily on GLUE and Lots-of-LoRAs; may need validation on more complex reasoning or long-context tasks

Reproducibility

Code availability is not explicitly provided in the text. The study uses the 'Lots-of-LoRAs' collection and 'MergeKit' library, which are public resources. Specific scripts for calculating MDS and reproducing the exact task groups (a-y) are not linked.

📊 Experiments & Results

Evaluation Setup

Merging groups of 8 fine-tuned models and evaluating on their respective tasks

Benchmarks:

GLUE (Natural Language Understanding)
Lots-of-LoRAs (Diverse text generation/understanding tasks)

Metrics:

Merging Loss (percentage degradation relative to individual fine-tuned performance)
Pearson Correlation Coefficient (between conflict metrics and merging loss)
ROUGE-L (for Lots-of-LoRAs)
Classification Accuracy (for GLUE)
Statistical methodology: Pearson correlation analysis with p-value reporting to determine significance of relationship between metrics and collapse

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Correlation analysis demonstrates that widely used parameter conflict metrics do not predict merging collapse, whereas the proposed hidden-state metric does.
GLUE (Qwen2.5-3B)	P-value (Parameter Magnitude Change)	0.05	> 0.05	Not applicable
GLUE (Qwen2.5-3B)	P-value (Parameter Sign Change)	0.05	> 0.05	Not applicable
GLUE (Qwen2.5-3B)	P-value (Hidden-state Distance Similarity)	0.05	< 0.05	Not applicable
Magnitude of merging collapse across different task groups.
GLUE	Merging Loss (Best Case)	0	-32.8	-32.8

Main Takeaways

Merging collapse is universal: All 5 tested methods (including TIES and DARE) suffer severe degradation on incompatible task groups, with 2/3 of Lots-of-LoRAs groups losing >30% performance.
Parameter conflicts are a red herring: Metrics based on weight sign/magnitude disagreements failed to correlate with actual merging performance, invalidating the premise of many existing merging algorithms.
Representational compatibility is key: The geometry of hidden states (measured by MDS) reliably predicts collapse, supporting the rate-distortion theoretical framework.
Task selection works: Replacing high-MDS tasks (incompatible) with lower-MDS ones significantly improves the performance of the merged model.

📚 Prerequisite Knowledge

Prerequisites

Model Merging / Task Arithmetic
Linear Mode Connectivity (LMC)
Rate-Distortion Theory
Parameter-Efficient Fine-Tuning (LoRA)

Key Terms

merging collapse: Phenomenon where a merged model exhibits catastrophic performance degradation compared to its constituent task-specific models

task vector: The difference vector between a fine-tuned model's weights and the pre-trained base model's weights

MDS: Merging Difficulty Score—a metric defined as the reciprocal of average representational similarity, where higher scores indicate greater resistance to successful merging

LMC: Linear Mode Connectivity—the property where the loss landscape between two model solutions is connected by a linear path of low loss

HiddenSim: Hidden-state Distance Similarity—a metric measuring the Euclidean distance between hidden representations of different models on the same inputs

Lots-of-LoRAs: A collection of diverse Low-Rank Adaptation checkpoints used for evaluating model merging capabilities