SPEAR-MM: Selective Parameter Evaluation and Restoration via Model Merging for Efficient Financial LLM Adaptation

📝 Paper Summary

Financial Domain Adaptation Continual Learning Model Merging

SPEAR-MM preserves general reasoning in financial LLMs by identifying critical parameters via spectral analysis after domain training and selectively restoring them from the base model using spherical interpolation.

Core Problem

Adapting LLMs to the financial domain via continual pretraining (CPT) causes catastrophic forgetting of general reasoning skills (e.g., mathematics, logic), while traditional fixes like rehearsal violate privacy or lack capacity.

Why it matters:

Financial institutions need models that understand proprietary data but must also retain general reasoning for complex analysis and customer support.
Existing solutions like EWC require expensive hyperparameter tuning, and rehearsal methods require storing sensitive data, violating strict financial privacy regulations.
Retraining to find the right balance between specialization and generalization is computationally prohibitive (thousands of GPU hours per configuration).

Concrete Example: A standard LLaMA-3.1-8B model adapted on financial documents improves at answering regulatory questions but its ability to solve GSM8K math problems drops from ~80% to ~55% (69.5% retention), making it unreliable for quantitative analysis.

Key Novelty

Post-Hoc Selective Parameter Restoration via Spectral Analysis

Instead of constraining the model during training, SPEAR-MM analyzes the fully adapted model to measure which layers drifted significantly using signal-to-noise ratio (SWCI) and structural rank changes (SVDR).
It selectively restores the 'drifting' parameters that caused forgetting by merging them back towards the base model using Spherical Linear Interpolation (SLERP), creating a hybrid model without retraining.

Architecture

The SPEAR-MM workflow pipeline.

Evaluation Highlights

Achieves 91.2% retention of general capabilities (vs. 69.7% for standard CPT) on LLaMA-3.1-8B while maintaining 94% of financial domain adaptation gains.
Reduces computational costs by >99% compared to exploring freezing strategies via retraining (1.5 GPU-hours for analysis vs. 18,000+ GPU-hours for CPT).
Restores mathematical reasoning (GSM8K) to 97.5% of the base model's performance, compared to just 69.5% retention for the standard CPT baseline.

Breakthrough Assessment

8/10

Offers a highly practical, privacy-preserving solution for regulated industries. drastic compute reduction for exploring trade-offs makes it very adoptable, though it relies on existing merging techniques (SLERP).

⚙️ Technical Details

Problem Definition

Setting: Post-hoc adaptation of a domain-specialized model M_prop (derived from base M_0) to recover general capabilities B_ext while retaining domain knowledge D_prop.

Inputs: Base model weights w_0, Adapted model weights w_prop

Outputs: Merged model weights w_merged

Pipeline Flow

Input: Base Model (M0) + Adapted Model (M_prop)
Step 1: Metric Calculation (Compute SWCI and SVDR for all layers)
Step 2: Ranking & Selection (Rank parameter blocks by importance score)
Step 3: Merging (Combine M0 and M_prop weights via SLERP based on rank)
Output: Restored Model (M_merged)

System Modules

Importance Scorer

Calculates layer-wise importance using SWCI (magnitude change) and SVDR (structural change)

Restoration Merger

Interpolates between base and adapted weights using SLERP

Novel Architectural Elements

Evaluation-time pipeline that combines spectral signal-to-noise ratio with weight displacement to determine layer-wise mixing ratios (architectural modification via selective weight rollback)

Modeling

Base Model: LLaMA-3.1-8B

Training Method: Continual Pretraining (CPT) followed by Post-Hoc Merging (SPEAR-MM)

Objective Functions:

Purpose: Rank parameter importance.

Formally: Score = alpha * SWCI + beta * SVDR
Purpose: Interpolate weights.

Formally: SLERP(w_0, w_prop, t) = (sin((1-t)theta)/sin(theta)) * w_prop + (sin(t*theta)/sin(theta)) * w_0

Adaptation: Full parameter update during CPT, then selective restoration

Training Data:

500B tokens of proprietary internal financial documents for CPT
250K instruction pairs for optional SFT

Compute: Initial CPT: ~18,000 GPU-hours (256 A100s for 3 days). SPEAR-MM Analysis: 1.5 GPU-hours.

Comparison to Prior Work

vs. EWC: SPEAR-MM is post-hoc (applied after training), avoiding complex hyperparameter tuning during the expensive training phase.
vs. LoRA: LoRA limits adaptation capacity; SPEAR-MM allows full parameter adaptation then selectively rolls back.
vs. Model Soups: Model Soups typically average whole models; SPEAR-MM uses layer-wise spectral analysis to decide exactly *which* parameters to merge and by how much.
+ 1 more
vs. Spectrum [cited]: SPEAR-MM extends SNR analysis with structural metrics (SVDR) and applies it to restoration rather than training targeting.

Limitations

Approximation only: Does not exactly replicate selective training from initialization.
Fixed thresholds: Current implementation uses static hyperparameters for merging ratios rather than dynamic adjustment.
Single Model: Evaluated only on LLaMA-3.1-8B; scaling to 70B+ or other architectures (Mistral/Gemma) is future work.
Requires storage of both base and adapted model checkpoints to perform the merge.

Reproducibility

Proprietary financial dataset is not released. Code URL not provided. Method relies on standard spectral analysis and SLERP, which are reproducible given a base and fine-tuned model pair.

📊 Experiments & Results

Evaluation Setup

Comparison of general reasoning capabilities vs. financial domain performance.

Benchmarks:

GSM8K (Mathematical Reasoning)
ARC-Challenge (Reasoning)
CloseQA (Financial Factual QA)
MATH-HARD (Complex Mathematics)

Metrics:

Retention Rate (%) relative to Base Model
Absolute Accuracy Gain/Loss
Computational Cost (GPU-hours)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
General Capability Retention: SPEAR-MM (Aggressive config) significantly recovers performance lost during Continual Pretraining (CPT).
Average General Capability Retention	Retention %	69.7	91.2	+21.5
GSM8K	Retention %	69.5	97.5	+28.0
MATH-HARD	Retention %	33.8	Not reported in the paper	Not reported in the paper
Domain Adaptation: SPEAR-MM preserves most of the gains achieved by CPT on proprietary financial tasks.
CloseQA (Financial)	Accuracy Gain	64.9	53.3	-11.6

Experiment Figures

Pareto frontier of Domain Adaptation vs. Knowledge Retention.

Layer-wise impact/importance scores for different transformer components.

Main Takeaways

Aggressive configuration restores 91.2% of general capabilities while maintaining 80% of domain gains; Balanced configuration maintains 88.7% general retention and 92% domain gains.
Layer specialization analysis reveals MLPs are critical in early/late layers (input/output processing), while Attention Value projections ($v_{proj}$) become more important in deeper layers.
Combining SWCI (magnitude) and SVDR (structure) metrics outperforms using either metric alone for selecting parameters.
The method offers an interpretable 'control knob' for the specialization-generalization trade-off, essential for regulated financial environments.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Transformer architecture (Attention vs. MLP layers)
Continual Learning challenges (Catastrophic Forgetting)
Matrix decomposition (Singular Value Decomposition/Eigenvalues)

Key Terms

CPT: Continual Pretraining—training a pre-trained model on a new domain-specific corpus

SFT: Supervised Fine-Tuning—training a model on instruction-response pairs

Catastrophic Forgetting: The tendency of neural networks to lose previously learned knowledge (e.g., general math) when trained on new data (e.g., finance)

SLERP: Spherical Linear Interpolation—a method to blend two sets of weights by following the shortest path along a multi-dimensional sphere, preserving the magnitude (norm) better than simple averaging

SWCI: SNR-Weighted Change Intensity—a metric proposed in this paper that measures how much a parameter changed, weighted by its signal-to-noise ratio (importance)

SVDR: Singular Value Drop Ratio—a metric proposed in this paper measuring structural changes in a layer's information capacity via its singular values

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes the main weights and trains small adapter matrices

EWC: Elastic Weight Consolidation—a regularization technique that penalizes changes to parameters important for previous tasks

Spectral Analysis: Analyzing the eigenvalues or singular values of weight matrices to understand their signal strength and redundancy