SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models

📝 Paper Summary

Data Selection for Fine-tuning Efficient Training of LLMs

S2L selects high-quality fine-tuning data by clustering the training loss trajectories of a small proxy model, exploiting the finding that these trajectories predict gradient dynamics in much larger target models.

Core Problem

Selecting data for specialized domains (like math or medicine) is computationally expensive because heuristic metrics fail and using large models to score individual examples is prohibitive.

Why it matters:

Specialized domains (math, medicine, code) require high-quality SFT data, but generalist metrics like perplexity often fail to identify the most effective training examples due to distribution shifts
Scoring every training example with a target LLM (e.g., 7B+ parameters) to determine its value is too slow and resource-intensive for large datasets

Concrete Example: Metrics like perplexity derived from a pre-trained model might rank data points poorly for fine-tuning because they don't capture learning dynamics. S2L avoids this by observing actual training progress (loss over time) on a cheaper proxy model.

Key Novelty

Trajectory-Based Scalable Data Selection

Uses the sequence of loss values (trajectory) over the course of training a small 'proxy' model as a feature vector for each data point, rather than a single static score
Clusters these trajectories to group examples with similar learning dynamics, then samples uniformly from clusters to ensure gradient diversity during training of the large target model
Leverages the observation that training dynamics transfer across model scales, allowing a 70M parameter model to guide data selection for a 7B parameter model

Architecture

The conceptual workflow of the S2L data selection process.

Evaluation Highlights

Matches the performance of the full MathInstruct dataset using only 11% of the original data (approx. 30K examples vs 260K)
Achieves 32.7% accuracy on the MATH benchmark with just 50K training examples, improving the Phi-2 baseline by 16.6% (absolute percentage points implied by context)
Outperforms state-of-the-art data selection algorithms (including GraNd and EL2N) by an average of 4.7% across 6 in-domain and out-of-domain evaluation datasets

Breakthrough Assessment

8/10

Offers a highly practical solution to the 'compute vs. data quality' trade-off. Theoretical grounding of loss trajectory transfer across model scales is significant for efficient LLM training.

⚙️ Technical Details

Problem Definition

Setting: Supervised Fine-tuning (SFT) of Large Language Models on specialized domain data under a fixed data budget

Inputs: Training dataset D_train with pairs of prompts x and responses y; Data budget B

Outputs: A selected subset S subset-of D_train such that training on S maximizes performance on the specialized domain

Pipeline Flow

Input: Full Training Dataset
Module 1: Small Proxy Training & Trajectory Collection
Module 2: Trajectory Clustering
Module 3: Balanced Sampling
Module 4: Target Model Training (Standard SFT)

System Modules

Proxy Model Trainer (Data Selection)

Train a small model to generate loss signatures for all data points

Model or implementation: Small Reference Model (e.g., Pythia-70M)

Trajectory Clusterer (Data Selection)

Group examples based on similarity of their training dynamics

Model or implementation: Clustering Algorithm (e.g., K-Means)

Target Model Trainer

Fine-tune the large target model on the curated subset

Model or implementation: Target LLM (e.g., Phi-2, Llama-2-7B)

Modeling

Base Model: Phi-2 (2.7B) and Llama-2-7B (Target Models); Pythia-70M (Proxy Model)

Training Method: Supervised Fine-Tuning (SFT) on selected subset

Objective Functions:

Purpose: Minimize the negative log likelihood of the correct response given the prompt.

Formally: L(theta) = - sum log p_theta(y_t | x, y_{<t})

Adaptation: Full fine-tuning (implied, as LoRA is not explicitly mentioned as the primary method in the summary)

Trainable Parameters: Full model parameters (unless otherwise specified in full paper)

Training Data:

MathInstruct (Math problem solving)
MIMIC-III (Clinical text summarization)

Key Hyperparameters:

proxy_model_size: 70M parameters
target_model_size: Up to 7B parameters

Compute: Proxy model is 100x smaller than target, proportionally reducing selection cost compared to target-model-based scoring

Comparison to Prior Work

vs. EL2N/GraNd: S2L uses the full training trajectory (temporal dynamics) from a proxy model rather than a single snapshot snapshot metric
vs. Moderate DS: S2L adapts to the specific fine-tuning dynamics via proxy training rather than relying on zero-shot perplexity from a pre-trained model
vs. SCIP: S2L is domain-agnostic (uses loss trajectories) rather than relying on domain-specific features like code syntax [SCIP is cited]
+ 1 more
vs. LESS [not cited in paper]: LESS also selects data via influence estimation, but S2L explicitly focuses on clustering loss trajectories to ensure gradient diversity rather than direct influence optimization

Limitations

Relies on the assumption that loss trajectories on a small model correlate with gradient dynamics on a large model (though empirically validated)
Requires training a proxy model, which, while small, is an additional step compared to zero-shot metrics like perplexity
Performance depends on the quality of the clustering algorithm and the choice of the number of clusters

Reproducibility

Code: https://github.com/BigML-CS-UCLA/S2L

Code is publicly available at https://github.com/BigML-CS-UCLA/S2L. The paper uses public datasets (MathInstruct, MIMIC-III) and open-source models (Phi-2, Llama-2). Exact hyperparameters for the clustering step (e.g., number of clusters K) are not detailed in the summary text.

📊 Experiments & Results

Evaluation Setup

Supervised Fine-Tuning on domain-specific datasets followed by evaluation on in-domain and out-of-domain test sets

Benchmarks:

MathInstruct (Mathematical Problem Solving)
MATH (Challenging Mathematics Problems)
MIMIC-III (Clinical Text Summarization)

Metrics:

Accuracy (for Math)
ROUGE scores (implied for Summarization)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
S2L significantly reduces data requirements while matching or exceeding full dataset performance on mathematical reasoning tasks.
MathInstruct	Relative Data Size required to match Full Data Performance	100	11	-89
MATH	Accuracy	16.1	32.7	+16.6
Average across 6 datasets	Accuracy	Not reported in the paper	Not reported in the paper	+4.7
MIMIC-III	Performance relative to full data	100	100	0

Experiment Figures

Visualization of loss trajectory clusters, showing how different groups of examples exhibit distinct learning curves on the proxy model.

Main Takeaways

Training dynamics (loss trajectories) generalize across model scales, allowing a 70M parameter model to effectively select data for a 7B parameter model
Clustering loss trajectories is more effective than selecting based on static metrics like perplexity or single-snapshot error (EL2N), as it captures the learning difficulty over time
Data diversity (ensured by sampling from different trajectory clusters) is crucial for efficient fine-tuning in specialized domains
S2L is highly scalable, reducing computational costs for data selection by orders of magnitude compared to methods that require target model inference

📚 Prerequisite Knowledge

Prerequisites

Supervised Fine-tuning (SFT) of LLMs
Gradient Descent and Loss Landscapes
Clustering algorithms (e.g., K-Means)

Key Terms

SFT: Supervised Fine-tuning—adapting a pre-trained language model to a specific task using labeled examples

Loss Trajectory: The sequence of loss values recorded for a specific training example at multiple checkpoints throughout the training process

Proxy Model: A significantly smaller model (e.g., 70M params) used to compute efficient data selection metrics for a larger target model (e.g., 7B params)

Hessian: A matrix of second-order partial derivatives of the loss function, representing the curvature of the loss landscape

Incremental Gradient (IG): Optimization methods like Stochastic Gradient Descent that update parameters iteratively based on gradients of individual examples or mini-batches

MathInstruct: A large-scale dataset of mathematical problems and solutions used for instruction tuning LLMs

MIMIC-III: A widely used dataset containing de-identified health data associated with intensive care unit admissions, used here for clinical text summarization