Evaluation Setup
Merging groups of 8 fine-tuned models and evaluating on their respective tasks
Benchmarks:
- GLUE (Natural Language Understanding)
- Lots-of-LoRAs (Diverse text generation/understanding tasks)
Metrics:
- Merging Loss (percentage degradation relative to individual fine-tuned performance)
- Pearson Correlation Coefficient (between conflict metrics and merging loss)
- ROUGE-L (for Lots-of-LoRAs)
- Classification Accuracy (for GLUE)
- Statistical methodology: Pearson correlation analysis with p-value reporting to determine significance of relationship between metrics and collapse
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Correlation analysis demonstrates that widely used parameter conflict metrics do not predict merging collapse, whereas the proposed hidden-state metric does. |
| GLUE (Qwen2.5-3B) |
P-value (Parameter Magnitude Change) |
0.05 |
> 0.05 |
Not applicable
|
| GLUE (Qwen2.5-3B) |
P-value (Parameter Sign Change) |
0.05 |
> 0.05 |
Not applicable
|
| GLUE (Qwen2.5-3B) |
P-value (Hidden-state Distance Similarity) |
0.05 |
< 0.05 |
Not applicable
|
| Magnitude of merging collapse across different task groups. |
| GLUE |
Merging Loss (Best Case) |
0 |
-32.8 |
-32.8
|
Main Takeaways
- Merging collapse is universal: All 5 tested methods (including TIES and DARE) suffer severe degradation on incompatible task groups, with 2/3 of Lots-of-LoRAs groups losing >30% performance.
- Parameter conflicts are a red herring: Metrics based on weight sign/magnitude disagreements failed to correlate with actual merging performance, invalidating the premise of many existing merging algorithms.
- Representational compatibility is key: The geometry of hidden states (measured by MDS) reliably predicts collapse, supporting the rate-distortion theoretical framework.
- Task selection works: Replacing high-MDS tasks (incompatible) with lower-MDS ones significantly improves the performance of the merged model.