Evaluation Setup
Multiple-choice Question Answering on 8 standard benchmarks. Leave-one-out setup (train on 7, target 1).
Benchmarks:
- ARC-Easy (Grade-school reasoning)
- OpenBookQA (Open-book QA)
- WinoGrande (Commonsense reasoning)
- PIQA (Physical reasoning)
- MathQA (Mathematical reasoning)
- HellaSwag (Commonsense NLI)
- SocialIQA (Social interaction understanding)
- CommonsenseQA (Commonsense QA)
Metrics:
- Accuracy
- Wall-clock training time
- Training steps to convergence
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
ฮ |
| Mashup Learning consistently improves final accuracy compared to training from scratch across different model sizes and adaptation methods. |
| Average across 8 datasets |
Accuracy (LoRA) |
58.4 |
60.2 |
+1.8
|
| Average across 8 datasets |
Accuracy (LoRA) |
Not reported in the paper |
Not reported in the paper |
+0.7
|
| Average across 8 datasets |
Accuracy (Full FT) |
Not reported in the paper |
Not reported in the paper |
+1.9
|
| Comparisons using the 'Lots-of-LoRAs' collection show significant gains over the Text-to-LoRA baseline. |
| Average across 6 datasets |
Accuracy |
Not reported in the paper |
Not reported in the paper |
+5.1
|
| Convergence analysis demonstrates that Mashup Learning reaches baseline accuracy significantly faster. |
| Average across tasks |
Steps to match scratch accuracy (%) |
100 |
55.5 |
-44.5
|
Main Takeaways
- Consistent improvements: Mashup Learning improves over random initialization across 3 model families (Gemma-3 1B/4B, Gemma-2 2B) and 2 training regimes (LoRA, Full FT).
- Convergence speedup: Matches the accuracy of training-from-scratch in ~41-59% of the steps, translating to real wall-clock savings even after accounting for overhead.
- Broad applicability: Works for both full finetuning and parameter-efficient methods (LoRA), and scales to larger checkpoint libraries (Lots-of-LoRAs).
- Data efficiency in selection: A small proxy set of 256 samples is sufficient to identify high-quality source checkpoints.