How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition

📝 Paper Summary

Supervised Fine-Tuning (SFT) Multi-task Learning Data Composition

This paper analyzes how different SFT data compositions affect LLM abilities and proposes Dual-stage Mixed Fine-tuning (DMT) to balance specialized skills with general human alignment.

Core Problem

Fine-tuning LLMs on composite data (math, code, general instructions) often leads to performance conflicts or catastrophic forgetting, where improving one ability degrades others.

Why it matters:

Proprietary models like GPT-4 exhibit versatility, but open-source models struggle to maintain specialized skills (reasoning, coding) when fine-tuned for general alignment
Directly mixing large datasets creates conflicts in high-resource settings, while sequential training causes models to forget previous tasks
Understanding scaling laws for SFT data composition is crucial for building versatile open-source LLMs efficiently

Concrete Example: When LLaMA-33B is trained sequentially (Code/Math first, then General), its math score drops from ~57% (Specialized) to 44.24% due to catastrophic forgetting. Conversely, mixing all data at once degrades general alignment compared to pure general training.

Key Novelty

Dual-stage Mixed Fine-tuning (DMT)

Splits training into two stages: first maximizing specialized skills (math/code) on their full datasets, then training on general data mixed with a tiny fraction (e.g., 1%) of the specialized data
Leverages the finding that general abilities saturate quickly (~1k samples) while specialized skills need more data, and that a small 'replay' buffer prevents forgetting during the alignment phase

Architecture

Illustration of four different SFT training strategies: Multi-task Learning, Sequential Training, Mixed Sequential Training, and the proposed Dual-stage Mixed Fine-tuning (DMT).

Evaluation Highlights

LLaMA-33B with DMT achieves 56.36% on GSM8K, recovering nearly all math ability compared to Sequential Training (47.27%) and approaching the Math-only baseline (57.91%)
LLaMA-33B with DMT scores 6.73 on MT-Bench, outperforming the Multi-task learning baseline (6.07) and matching the General-only baseline (6.63)
General human-aligning abilities emerge and plateau with as few as ~1,000 samples (1/64 of ShareGPT), whereas math and code abilities scale log-linearly with data amount

Breakthrough Assessment

7/10

Provides valuable empirical scaling laws for SFT data composition and a practical, simple strategy (DMT) that effectively balances conflicting abilities. High utility for practitioners training general-purpose LLMs.

⚙️ Technical Details

Problem Definition

Setting: Supervised Fine-Tuning (SFT) of a pre-trained Large Language Model on a composite dataset D containing subsets for math, code, and general capabilities

Inputs: Natural language instructions (queries) corresponding to specific domains (Math, Code, General)

Outputs: Generated textual responses (solutions, code snippets, or chat responses)

Pipeline Flow

Input Instruction
LLaMA Transformer Layers
Autoregressive Generation
Output Response

System Modules

LLaMA Base Model

Pre-trained backbone providing fundamental language understanding and generation capabilities

Model or implementation: LLaMA (7B, 13B, 33B variants)

Modeling

Base Model: LLaMA (7B, 13B, 33B)

Training Method: Dual-stage Mixed Fine-tuning (DMT)

Objective Functions:

Purpose: Minimize the difference between generated tokens and ground truth responses.

Formally: Standard Cross-Entropy Loss over the target tokens.

Adaptation: Full fine-tuning (implied by FastChat usage and lack of LoRA mention)

Training Data:

Math: GSM8K RFT (Reasoning Fine-Tuning)
Code: Code Alpaca
General: ShareGPT (approx. 100k samples)
Composite datasets created by sampling proportions {1, 1/4, 1/16, 1/64, 1/256}

Key Hyperparameters:

learning_rate: 2e-5 (peak)
batch_size: 16
epochs: 3

Compute: Not reported in the paper

Comparison to Prior Work

vs. Multi-task Learning: DMT avoids high-resource conflicts by separating specialized training and only mixing a small replay buffer in the final stage
vs. Sequential Training: DMT explicitly addresses catastrophic forgetting by reintroducing specialized data in the final stage
vs. LIMA: Confirms LIMA's finding that general alignment needs little data, but extends this to show specialized skills (math/code) still require large-scale data, necessitating a hybrid strategy like DMT

Limitations

Experiments limited to LLaMA models up to 33B parameters due to compute constraints
Evaluation relies on MT-Bench (GPT-4 based), which may have biases
Focuses only on three capabilities (Math, Code, General), excluding others like translation or creative writing

Reproducibility

The paper uses open-source datasets (GSM8K, Code Alpaca, ShareGPT) and the FastChat framework. Specific training scripts are not linked, but hyperparameters are provided. Training FLOPs are listed in Appendix D.

📊 Experiments & Results

Evaluation Setup

Supervised Fine-Tuning on single or composite datasets followed by evaluation on specific capability benchmarks

Benchmarks:

GSM8K (Mathematical Reasoning)
HumanEval (Code Generation)
MT-Bench (General Human Alignment / Chat)

Metrics:

Accuracy (GSM8K)
Pass@1 (HumanEval)
MT-Bench Score (1-10 scale)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of different training strategies on LLaMA-33B shows DMT achieves the best balance across all three metrics.
GSM8K	Accuracy	44.24	56.36	+12.12
MT-Bench	Score	6.07	6.73	+0.66
HumanEval	Pass@1	18.9	25.00	+6.10
Scaling experiments reveal different data requirements for different abilities.
MT-Bench	Score	4.5	6.5	+2.0

Experiment Figures

Scaling curves for Math, Code, and General abilities across LLaMA-7B, 13B, and 33B as data amount increases.

Comparison of 'Individual' vs 'Mixed' data training across model sizes.

Main Takeaways

Distinct scaling laws: Math and Code abilities improve log-linearly with more data, while General Human Alignment plateaus quickly (after ~1k samples).
Data mixing effects: Mixing data sources helps performance in low-resource settings (beneficial noise/regularization) but causes conflicts in high-resource settings.
Catastrophic forgetting: Sequential training preserves the most recent task but severely degrades prior specialized skills; DMT mitigates this by mixing a small ratio of prior data.
Model size impact: Larger models (33B vs 7B) benefit more from mixed data in low-resource settings and are generally more robust to data composition strategies.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and Supervised Fine-Tuning (SFT)
Familiarity with catastrophic forgetting in multi-task learning
Basic knowledge of scaling laws (relationship between data/compute and performance)

Key Terms

SFT: Supervised Fine-Tuning—the process of training a pre-trained base model on labeled instruction-response pairs to activate specific capabilities

DMT: Dual-stage Mixed Fine-tuning—the proposed strategy of training on specialized data first, then on general data mixed with a small amount of specialized data

Catastrophic Forgetting: A phenomenon where a model abruptly forgets previously learned information upon learning new information

GSM8K: Grade School Math 8K—a benchmark dataset of 8.5k high quality linguistically diverse grade school math word problems

HumanEval: A benchmark for evaluating code generation capabilities, consisting of programming problems

MT-Bench: A benchmark for evaluating the conversational and instruction-following abilities of LLMs using multi-turn questions

ShareGPT: A dataset of user-shared conversations with ChatGPT, used for training general human alignment