Fine-tuning large language models for domain adaptation: exploration of training strategies, scaling, model merging and synergistic capabilities

📝 Paper Summary

Domain Adaptation Model Merging Materials Science

Combining domain-specific fine-tuning with spherical model merging unlocks synergistic capabilities in 7B-8B parameter models that exceed the performance of parent models, though this emergence is absent in smaller 1.7B models.

Core Problem

General-purpose LLMs lack specialized technical knowledge for fields like materials science, while standard fine-tuning often degrades general capabilities or fails to fully integrate new domain logic.

Why it matters:

Developing specialized models for complex engineering domains (e.g., biomateriomics) is computationally expensive if training from scratch
Standard fine-tuning strategies often result in catastrophic forgetting or linear trade-offs rather than synergistic performance gains
Understanding how model scale influences the success of merging strategies is critical for efficient deployment on edge devices vs. servers

Concrete Example: When a base model is fine-tuned on materials science papers (CPT), it may lose its instruction-following ability; conversely, an instruct model lacks specific knowledge about 'biological material design concepts'. Merging them attempts to retain both.

Key Novelty

Synergistic Domain Adaptation via SLERP Merging

Applies Spherical Linear Interpolation (SLERP) to merge a domain-adapted model (trained via CPT, SFT, and ORPO) with a general-purpose instruction-tuned model
Demonstrates that this merging is not merely additive but creates nonlinear 'emergent' capabilities where the merged model outperforms the average of its parents
Identifies a scaling threshold where 7B+ parameter models exhibit this synergy, while 1.7B parameter models do not

Architecture

Comparison of two training pipelines: (A) Linear sequence of CPT->SFT->DPO/ORPO, and (B) The proposed pipeline adding Model Merging (SLERP) at the end.

Evaluation Highlights

Mistral-7B variants using SLERP merging achieve >20% relative improvement over the Mistral-7B-Instruct-v0.3 baseline on domain benchmarks
The best fine-tuned Mistral model achieves an absolute accuracy score of 0.81 on the overall domain benchmark using the integrated dataset
Llama-3.1-8B merged models show ~12% relative improvement over the Instruct baseline, with Instruct-CPT-SFT-ORPO-SLERP being the top strategy

Breakthrough Assessment

7/10

Provides strong empirical evidence for the non-linear benefits of SLERP merging in domain adaptation and identifies important scaling boundaries, though the method relies on existing algorithms (SLERP, ORPO).

⚙️ Technical Details

Problem Definition

Setting: Domain adaptation of LLMs for specialized scientific reasoning and knowledge retrieval

Inputs: Natural language queries regarding materials science, spider silk, and bio-inspired design

Outputs: Accurate, reasoned scientific explanations and design concepts

Pipeline Flow

Base Model Selection (Llama-3.1-8B or Mistral-7B)
Continued Pre-Training (CPT) on scientific corpus
Supervised Fine-Tuning (SFT) on Q&A pairs
Preference Optimization (ORPO or DPO)
Model Merging (SLERP) with original Instruct Model

System Modules

Domain Adapter (Training)

Imbue model with domain knowledge via CPT and SFT

Model or implementation: Llama-3.1-8B or Mistral-7B-v0.3

Merger

Combine fine-tuned weights with general instruct weights to recover general capabilities and unlock synergy

Model or implementation: SLERP (Algorithm)

Modeling

Base Model: Llama-3.1-8B, Mistral-7B-v0.3, and SmolLM-1.7B

Training Method: Multi-stage pipeline: CPT -> SFT -> Preference Optimization (DPO/ORPO) -> SLERP Merging

Objective Functions:

Purpose: Adapt model to domain text.

Formally: Causal Language Modeling loss (Next Token Prediction).
Purpose: Align model with human preferences.

Formally: DPO or ORPO objective functions (optimizing likelihood of preferred over rejected responses).

Training Data:

Corpus of 1,000 PDF papers (processed to text)
Extended dataset of 8,000 papers (varied quality)
Distilled Q&A pairs and reasoning chains
lamm-mit/magpie-ultra-v0.1 dataset

Key Hyperparameters:

cpt_epochs: 5 (Mistral Instruct peaks here)
slerp_interpolation_factor: Variable (visualized as t in Figure 3)

Comparison to Prior Work

vs. LoRA: LoRA limits knowledge incorporation; CPT+SLERP allows deeper parameter updates and synergy
vs. LERP: SLERP respects parameter space geometry (sphere vs Euclidean), preventing performance degradation during merging
vs. Standard SFT: This approach uses Model Merging as a final step to regain general capabilities lost during aggressive domain tuning

Limitations

Emergent capabilities from merging are not observed in smaller models (1.7B parameters), suggesting a scaling threshold.
Lower quality OCR data (from Nougat) in the extended dataset negatively impacted performance compared to the cleaner, smaller dataset.
Base models without instruction tuning (e.g., Mistral Base) show fluctuating performance during CPT compared to Instruct models.

Reproducibility

Model weights referenced as 'lamm-mit/...' (likely HuggingFace). Specific training scripts/code availability explicitly marked as 'not provided' in the text, though text mentions 'references to codes'. Datasets described but not explicitly linked as a downloadable package.

📊 Experiments & Results

Evaluation Setup

Domain-specific QA and reasoning tasks in materials science

Benchmarks:

Spider Silk Benchmark (Domain knowledge QA) [New]
Bio-inspired/Biological Materials (Domain reasoning) [New]
Overall Accuracy (Aggregate score)

Metrics:

Accuracy (fraction of correct answers)
Relative Improvement (%) over baseline
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Overall Accuracy	Accuracy	Not reported in the paper	0.81	Not reported in the paper
Overall Accuracy	Accuracy	0.80	0.81	+0.01

Experiment Figures

Scatter plot of Actual Performance vs Expected Performance (average of parents) for merged models.

Main Takeaways

SLERP merging generates 'synergistic' capabilities where the merged model outperforms the linear average of its parents, particularly for 7B/8B models.
Mistral-7B models benefit most from the Instruct-CPT-ORPO-SLERP strategy, showing >20% relative improvement over the baseline.
SmolLM (1.7B) does not benefit from merging strategies in the same way, with CPT-SFT-DPO (unmerged) performing best, suggesting a minimum capacity required for synergistic merging.
Data quality (clean text vs noisy OCR) is more critical than quantity for CPT in this domain.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM fine-tuning stages (Pre-training, SFT, Alignment)
Familiarity with vector interpolation methods
Basic knowledge of model scaling laws

Key Terms

CPT: Continued Pre-Training—training a base model on domain-specific raw text before instruction tuning

SFT: Supervised Fine-Tuning—training a model on labeled instruction-response pairs

DPO: Direct Preference Optimization—an alignment method optimizing the model based on preference pairs without a separate reward model

ORPO: Odds Ratio Preference Optimization—a monolithic preference alignment method that doesn't require a reference model

SLERP: Spherical Linear Interpolation—a method to merge model weights by interpolating along a spherical path to preserve geometric structure, rather than a straight line (LERP)

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique injecting low-rank matrices into linear layers

Biomateriomics: The study of biological materials and their application in engineering and design