Combining domain-specific fine-tuning with spherical model merging unlocks synergistic capabilities in 7B-8B parameter models that exceed the performance of parent models, though this emergence is absent in smaller 1.7B models.
Core Problem
General-purpose LLMs lack specialized technical knowledge for fields like materials science, while standard fine-tuning often degrades general capabilities or fails to fully integrate new domain logic.
Why it matters:
Developing specialized models for complex engineering domains (e.g., biomateriomics) is computationally expensive if training from scratch
Standard fine-tuning strategies often result in catastrophic forgetting or linear trade-offs rather than synergistic performance gains
Understanding how model scale influences the success of merging strategies is critical for efficient deployment on edge devices vs. servers
Concrete Example:When a base model is fine-tuned on materials science papers (CPT), it may lose its instruction-following ability; conversely, an instruct model lacks specific knowledge about 'biological material design concepts'. Merging them attempts to retain both.
Key Novelty
Synergistic Domain Adaptation via SLERP Merging
Applies Spherical Linear Interpolation (SLERP) to merge a domain-adapted model (trained via CPT, SFT, and ORPO) with a general-purpose instruction-tuned model
Demonstrates that this merging is not merely additive but creates nonlinear 'emergent' capabilities where the merged model outperforms the average of its parents
Identifies a scaling threshold where 7B+ parameter models exhibit this synergy, while 1.7B parameter models do not
Architecture
Comparison of two training pipelines: (A) Linear sequence of CPT->SFT->DPO/ORPO, and (B) The proposed pipeline adding Model Merging (SLERP) at the end.
Evaluation Highlights
Mistral-7B variants using SLERP merging achieve >20% relative improvement over the Mistral-7B-Instruct-v0.3 baseline on domain benchmarks
The best fine-tuned Mistral model achieves an absolute accuracy score of 0.81 on the overall domain benchmark using the integrated dataset
Llama-3.1-8B merged models show ~12% relative improvement over the Instruct baseline, with Instruct-CPT-SFT-ORPO-SLERP being the top strategy
Breakthrough Assessment
7/10
Provides strong empirical evidence for the non-linear benefits of SLERP merging in domain adaptation and identifies important scaling boundaries, though the method relies on existing algorithms (SLERP, ORPO).
⚙️ Technical Details
Problem Definition
Setting: Domain adaptation of LLMs for specialized scientific reasoning and knowledge retrieval
Inputs: Natural language queries regarding materials science, spider silk, and bio-inspired design
Outputs: Accurate, reasoned scientific explanations and design concepts
Pipeline Flow
Base Model Selection (Llama-3.1-8B or Mistral-7B)
Continued Pre-Training (CPT) on scientific corpus
Supervised Fine-Tuning (SFT) on Q&A pairs
Preference Optimization (ORPO or DPO)
Model Merging (SLERP) with original Instruct Model
System Modules
Domain Adapter (Training)
Imbue model with domain knowledge via CPT and SFT
Model or implementation: Llama-3.1-8B or Mistral-7B-v0.3
Merger
Combine fine-tuned weights with general instruct weights to recover general capabilities and unlock synergy
Model or implementation: SLERP (Algorithm)
Modeling
Base Model: Llama-3.1-8B, Mistral-7B-v0.3, and SmolLM-1.7B
Formally: Causal Language Modeling loss (Next Token Prediction).
Purpose: Align model with human preferences.
Formally: DPO or ORPO objective functions (optimizing likelihood of preferred over rejected responses).
Training Data:
Corpus of 1,000 PDF papers (processed to text)
Extended dataset of 8,000 papers (varied quality)
Distilled Q&A pairs and reasoning chains
lamm-mit/magpie-ultra-v0.1 dataset
Key Hyperparameters:
cpt_epochs: 5 (Mistral Instruct peaks here)
slerp_interpolation_factor: Variable (visualized as t in Figure 3)
Comparison to Prior Work
vs. LoRA: LoRA limits knowledge incorporation; CPT+SLERP allows deeper parameter updates and synergy
vs. LERP: SLERP respects parameter space geometry (sphere vs Euclidean), preventing performance degradation during merging
vs. Standard SFT: This approach uses Model Merging as a final step to regain general capabilities lost during aggressive domain tuning
Limitations
Emergent capabilities from merging are not observed in smaller models (1.7B parameters), suggesting a scaling threshold.
Lower quality OCR data (from Nougat) in the extended dataset negatively impacted performance compared to the cleaner, smaller dataset.
Base models without instruction tuning (e.g., Mistral Base) show fluctuating performance during CPT compared to Instruct models.
Reproducibility
Model weights referenced as 'lamm-mit/...' (likely HuggingFace). Specific training scripts/code availability explicitly marked as 'not provided' in the text, though text mentions 'references to codes'. Datasets described but not explicitly linked as a downloadable package.
📊 Experiments & Results
Evaluation Setup
Domain-specific QA and reasoning tasks in materials science
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
Overall Accuracy
Accuracy
Not reported in the paper
0.81
Not reported in the paper
Overall Accuracy
Accuracy
0.80
0.81
+0.01
Experiment Figures
Scatter plot of Actual Performance vs Expected Performance (average of parents) for merged models.
Main Takeaways
SLERP merging generates 'synergistic' capabilities where the merged model outperforms the linear average of its parents, particularly for 7B/8B models.
Mistral-7B models benefit most from the Instruct-CPT-ORPO-SLERP strategy, showing >20% relative improvement over the baseline.
SmolLM (1.7B) does not benefit from merging strategies in the same way, with CPT-SFT-DPO (unmerged) performing best, suggesting a minimum capacity required for synergistic merging.
Data quality (clean text vs noisy OCR) is more critical than quantity for CPT in this domain.
📚 Prerequisite Knowledge
Prerequisites
Understanding of LLM fine-tuning stages (Pre-training, SFT, Alignment)
Familiarity with vector interpolation methods
Basic knowledge of model scaling laws
Key Terms
CPT: Continued Pre-Training—training a base model on domain-specific raw text before instruction tuning
SFT: Supervised Fine-Tuning—training a model on labeled instruction-response pairs
DPO: Direct Preference Optimization—an alignment method optimizing the model based on preference pairs without a separate reward model
ORPO: Odds Ratio Preference Optimization—a monolithic preference alignment method that doesn't require a reference model
SLERP: Spherical Linear Interpolation—a method to merge model weights by interpolating along a spherical path to preserve geometric structure, rather than a straight line (LERP)
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique injecting low-rank matrices into linear layers
Biomateriomics: The study of biological materials and their application in engineering and design