Self-Distillation Bridges Distribution Gap in Language Model Fine-Tuning

📝 Paper Summary

LLM Fine-Tuning Safety Alignment Continual Learning

SDFT mitigates catastrophic forgetting and safety degradation during fine-tuning by rewriting task data into the model's own distribution before training, thereby bridging the distribution gap.

Core Problem

Fine-tuning Large Language Models (LLMs) on downstream tasks causes catastrophic forgetting of general capabilities and safety alignment due to the distribution gap between the task dataset and the pre-trained model.

Why it matters:

Fine-tuning for specific tasks (e.g., coding) often severely degrades general instruction-following and safety (e.g., jailbreak resistance drops from ~88% to ~54%)
Preserving the 'seed' model's safety guardrails is critical for deploying fine-tuned models in real-world applications
Current methods fail to simultaneously enhance task-specific performance and maintain general abilities without complex regularization

Concrete Example: When Llama-2-chat is fine-tuned on the OpenFunctions dataset to improve tool use, its general coding ability (HumanEval pass@1) drops from 13.4% to 9.8%. Similarly, fine-tuning on GSM8K reduces its resistance to jailbreak attacks from 88.9% to 54.8%.

Key Novelty

Self-Distillation Fine-Tuning (SDFT)

Instead of training directly on external task data, the model first rewrites the target responses to match its own internal distribution while preserving semantic meaning.
These self-generated (distilled) responses replace the original targets, creating a training set that aligns with the model's pre-existing knowledge structure.
Heuristic filtering ensures the distilled responses remain correct (e.g., retaining the correct final answer) before they are used for fine-tuning.

Architecture

Conceptual workflow of Self-Distillation Fine-Tuning (SDFT).

Evaluation Highlights

Preserves general coding ability: On OpenFunctions fine-tuning, SDFT achieves 15.24% HumanEval pass@1 vs. 9.76% for vanilla fine-tuning (restoring performance to Seed LM levels).
Restores safety alignment: On GSM8K fine-tuning, SDFT maintains an 80.77% jailbreak safe rate compared to 54.81% for vanilla fine-tuning.
Maintains general helpfulness: SDFT achieves a 66.73% win rate on AlpacaEval after GSM8K fine-tuning, significantly outperforming vanilla fine-tuning's 23.38%.

Breakthrough Assessment

7/10

Offers a simple, effective data-centric solution to a major problem (safety degradation during fine-tuning). While methodologically straightforward, the impact on safety preservation is substantial.

⚙️ Technical Details

Problem Definition

Setting: Supervised Fine-Tuning of a Seed LM on a downstream task dataset while preserving original capabilities

Inputs: Seed LM f_theta, Task Dataset D = {(x^t, y^t)}

Outputs: Fine-tuned model f_theta'

Modeling

Base Model: Llama-2-7b-chat

Training Method: Self-Distillation Fine-Tuning (SDFT)

Objective Functions:

Purpose: Minimize the negative log-likelihood of the self-distilled response given the input.

Formally: L_SDFT(theta) = - sum log P_theta(tilde{y} | x, c)

Adaptation: Low Rank Adaptation (LoRA)

Training Data:

Single-task datasets: GSM8K (Math), OpenFunctions (Tool use), MagiCoder (Code)
Multi-task datasets: Alpaca, Dolly, LIMA, OpenHermes
Sampling: Randomly selected 2,000 examples for most datasets; 20,000 for OpenHermes

Key Hyperparameters:

training_subset_size: 2000 (standard), 20000 (OpenHermes)

Compute: Limited computation resources mentioned; used LoRA for efficiency

Comparison to Prior Work

vs. Vanilla Fine-Tuning: SDFT trains on model-generated 'rewritten' responses rather than ground-truth target responses
vs. Self-Play Fine-tuning: SDFT uses generated responses as targets to bridge distribution gap, whereas Self-Play converges to training data distribution [not cited in paper as direct baseline, but discussed in related work]
vs. Knowledge Distillation: SDFT uses the student itself as the teacher at t=0, rather than a separate larger teacher model

Limitations

Depends on the seed model's capability to generate high-quality rewritten responses; weak models may fail to distill correctly.
Requires task-specific heuristics (e.g., checking final answer in math) to filter low-quality distilled data.
Rewriting the entire dataset adds a computational overhead before fine-tuning begins.

Reproducibility

Code: https://github.com/sail-sg/sdft

Code is publicly available at https://github.com/sail-sg/sdft. The paper specifies dataset sources and sample sizes. Distillation templates are provided in the Appendix.

📊 Experiments & Results

Evaluation Setup

Fine-tuning Llama-2-7b-chat on specific downstream tasks and evaluating both task performance and retention of general/safety capabilities.

Benchmarks:

GSM8K (Mathematical Reasoning)
HumanEval (Code Generation)
Advbench (Safety/Jailbreak Evaluation)
AlpacaEval (General Helpfulness)

Metrics:

Accuracy (Math)
Pass@1 (Code)
Raw Safe Rate
Jailbreak Safe Rate
Win Rate (Helpfulness)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Fine-tuning on the OpenFunctions tool-use dataset degrades general coding skills (HumanEval), but SDFT prevents this drop.
HumanEval	pass@1	9.76	15.24	+5.48
Safety evaluation shows massive degradation with vanilla fine-tuning on GSM8K, which SDFT largely recovers.
Advbench	Jailbreak Safe Rate	54.81	80.77	+25.96
AlpacaEval	Win Rate	23.38	66.73	+43.35
Multi-task fine-tuning on OpenHermes also benefits from SDFT in terms of safety.
Advbench	Jailbreak Safe Rate	61.54	87.50	+25.96

Experiment Figures

A specific example of response rewriting and the resulting distribution shift.

Impact on General Knowledge benchmarks (MMLU, TruthfulQA, etc.).

Main Takeaways

Vanilla fine-tuning consistently degrades safety alignment and general helpfulness across both single-task and multi-task datasets.
SDFT effectively bridges the distribution gap, allowing the model to learn downstream tasks without catastrophic forgetting of safety guardrails.
The method is robust across different domains (math, code, tool use) and dataset sizes (2k to 20k examples).

📚 Prerequisite Knowledge

Prerequisites

Supervised Fine-Tuning (SFT)
Catastrophic Forgetting
Knowledge Distillation
Low Rank Adaptation (LoRA)

Key Terms

Seed LM: The base Large Language Model (e.g., Llama-2-chat) used as the starting point for fine-tuning

Self-Distillation: A process where the model acts as its own teacher, generating training targets to align data with its internal distribution

Catastrophic Forgetting: The tendency of neural networks to lose previously learned knowledge (like safety alignment) when trained on new tasks

Jailbreak Safe Rate: The percentage of times a model refuses to generate harmful content when prompted with adversarial 'jailbreak' attacks

LoRA: Low Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and trains disjoint rank-decomposition matrices