Shadow-FT: Tuning Instruct Model via Training on Paired Base Model

📝 Paper Summary

Supervised Fine-Tuning (SFT) Parameter-Efficient Fine-Tuning (PEFT)

Shadow-FT mitigates performance degradation during fine-tuning by training the Base model (the 'shadow') instead of the Instruct model, then grafting the weight updates directly onto the Instruct model.

Core Problem

Directly fine-tuning instruction-tuned (Instruct) models often yields marginal gains or causes catastrophic forgetting and performance degeneration on downstream tasks.

Why it matters:

Users frequently need to adapt powerful Instruct models to specific domains without losing their general instruction-following capabilities.
The standard practice of tuning Instruct models breaks their carefully aligned internal representations, leading to worse reasoning and coding performance.
Current methods force a trade-off between learning new domain knowledge and retaining the robust alignment of the Instruct model.

Concrete Example: When fine-tuning Qwen-3-4B-Instruct on the BAAI-2k dataset using conventional LoRA, the model suffers a drop of 6.8 points on the Code-3 benchmark (from 66.4 to 59.6) instead of improving.

Key Novelty

Shadow-FT (Shadow Fine-Tuning)

Leverages the high weight similarity between paired Base and Instruct models to use the Base model as a stable 'shadow' for training.
Calculates weight updates (deltas) by fine-tuning the Base model on the target data, avoiding the rigid optimization resistance often found in Instruct models.
Directly adds these learned deltas to the frozen Instruct model's weights, effectively transferring new knowledge without disrupting existing alignment.

Architecture

Conceptual workflow of Shadow-FT vs Traditional Tuning. Note: The paper describes this in Section 3 text and equations rather than a single explicit architecture diagram, but the logic is clear.

Evaluation Highlights

Outperforms conventional LoRA by +10.1 points on Code-3 benchmark when tuning Qwen-3-4B (69.7 vs 59.6).
Achieves +6.2 points improvement on Math-7 with Qwen-3-8B on Code-Z1 dataset compared to standard LoRA (77.4 vs 71.2).
Scales effectively to Multimodal LLMs, boosting Gemma-3-27B performance on ChartQA by +3.52 points over vanilla LoRA.

Breakthrough Assessment

8/10

Simple yet highly effective method that solves a pervasive problem (tuning degradation) with zero inference cost. The insight about weight similarity and gradient dynamics between Base/Instruct is significant.

⚙️ Technical Details

Problem Definition

Setting: Supervised Fine-Tuning (SFT) of Large Language Models where an Instruct variant W_instruct is the target for deployment, but training is performed on W_base.

Inputs: A pre-trained Base model W_base, an Instruct model W_instruct, and a specific domain dataset D.

Outputs: An updated Instruct model W_instruct_new.

Pipeline Flow

Load paired Base and Instruct models
Fine-tune Base model on target dataset (calculate delta W)
Graft delta W onto Instruct model

System Modules

Base Model Tuner

Learns task-specific features from the dataset without the interference of existing instruction tuning

Model or implementation: Various (e.g., Llama-3, Qwen-3)

Weight Grafter

Transfers learned capabilities to the aligned model

Model or implementation: Arithmetic operation

Novel Architectural Elements

Cross-model gradient transfer: Optimizing parameters on one model topology (Base) to update a different but structurally identical model (Instruct).

Modeling

Base Model: Llama-3.1-8B, Qwen-3-4B/8B/14B, Gemma-3-12B/27B, Llama-3.2-Vision

Training Method: Shadow-FT (training Base, applying to Instruct)

Objective Functions:

Purpose: Minimize standard language modeling loss on the Base model.

Formally: minimize Loss(W_base + Delta W) on dataset D.

Adaptation: LoRA (rank=128, alpha=16) and Full Fine-tuning

Trainable Parameters: Varies (LoRA parameters or full weights of Base model)

Training Data:

BAAI-2k: 2000 samples from BAAI-Infinity-Instruct
Domain datasets: Medical-o1-reasoning-SFT (1k), Code-Z1 (2k), LIMO (1k)

Key Hyperparameters:

learning_rate: 5e-5 (Full FT), 2e-4 (LoRA)
batch_size: 128
epochs: 3
+ 4 more
lora_rank: 128
lora_alpha: 16
max_length: 2048 (training)
warmup_ratio: 0.03

Compute: 8 A100 GPUs

Comparison to Prior Work

vs. Chat Vector: Shadow-FT applies to general tuning (SFT, DPO) rather than just continual pre-training; Shadow-FT adds Base updates to Instruct, whereas Chat Vector adds Instruct-Base diff to Base.
vs. Vanilla LoRA: Shadow-FT trains the Base model parameters instead of the Instruct model parameters to generate the adapters.
vs. Model Soups [not cited in paper]: Model Soups averages weights of multiple fine-tuned models; Shadow-FT adds a specific fine-tuned delta (from Base) to a different starting point (Instruct).

Limitations

Requires access to the exact paired Base model corresponding to the Instruct model.
Relies on the assumption that Base and Instruct weights remain highly similar (empirically validated for Llama/Qwen but may vary).
Shadow-FT (Full) sometimes underperforms Shadow-FT (LoRA) on mathematical tasks, contrary to conventional trends.
Optimal scaling factor for grafting weights is not fully explored (currently defaults to 1.0 or slight variations).

Reproducibility

Code: https://github.com/Tencent/Tencent-Hunyuan-Large

Code availability is not explicitly provided in the text. Datasets (BAAI-Infinity-Instruct, Code-Z1, etc.) are public. Hyperparameters are detailed in Appendix D.1.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on standard benchmarks across Math, Code, and Knowledge domains.

Benchmarks:

Math-7 (Mathematical reasoning (GSM8K, MATH, etc.))
Code-3 (Code generation (HumanEval, HumanEval+, LiveCodeBench))
Knowledge-9 (Commonsense and domain knowledge (MMLU, GPQA, ARC, etc.))

Metrics:

Average accuracy/score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on general domain tuning (BAAI-2k) showing Shadow-FT prevents degeneration and improves over baselines.
Code-3	Average Score	59.6	69.7	+10.1
Math-7	Average Score	71.2	75.9	+4.7
Knowledge-9	Average Score	61.1	65.0	+3.9
Domain-specific tuning results demonstrating robustness in specialized tasks.
Math-7	Average Score	71.2	77.4	+6.2
Multimodal LLM results showing applicability beyond text-only models.
ChartQA	Score	60.28	63.80	+3.52

Experiment Figures

Histogram of weight value differences between Base and Instruct models for Llama-3.1-8B.

Training dynamics comparison: Loss and Gradient Norm for Instruct (Vanilla FT) vs Base (Shadow-FT) during tuning.

Main Takeaways

Tuning Instruct models directly often leads to performance degeneration (e.g., Qwen-3-4B dropped from 66.4 to 59.6 on Code-3 with standard LoRA).
Shadow-FT consistently outperforms conventional Full FT and LoRA across multiple model families (Llama, Qwen, Gemma) and sizes (1B to 90B).
The method is robust to rank selection in LoRA; unlike standard LoRA where higher ranks can hurt Instruct tuning, Shadow-FT benefits from higher ranks.
Analysis of gradients shows Base models have smoother optimization landscapes compared to Instruct models, which exhibit rigid resistance to updates.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Supervised Fine-Tuning (SFT) and Low-Rank Adaptation (LoRA)
Distinction between Base (pre-trained) and Instruct (post-trained/aligned) LLMs
Basic matrix arithmetic for weight updates

Key Terms

Base model: A large language model pre-trained on massive text corpora to predict the next token, serving as the foundation for further tuning.

Instruct model: A version of the Base model that has undergone post-training (SFT, RLHF) to follow user instructions and align with human preferences.

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes original weights and trains small low-rank matrices to approximate weight updates.

DPO: Direct Preference Optimization—an alignment method that optimizes models to prefer specific responses over others without a separate reward model.

BAAI-2k: A subset of 2000 high-quality instruction samples extracted from the BAAI-Infinity-Instruct Dataset used for tuning experiments in this paper.

MLLM: Multimodal Large Language Model—an LLM capable of processing and generating content across multiple modalities like text and images.

Task Vector: A vector representing the difference in weights between a fine-tuned model and its pre-trained base, encoding specific task capabilities.

Gradient descent: An optimization algorithm used to minimize the loss function by iteratively moving in the direction of steepest descent.

Pass@k: A metric measuring the probability that at least one of the top k generated code solutions is correct.