Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

📝 Paper Summary

Data Selection for LLMs Efficient Fine-Tuning

GOT-D selects pre-fine-tuning data by calculating which samples best shift the pre-training distribution toward the target distribution using gradients of the Optimal Transport distance.

Core Problem

Existing data selection methods prioritize samples that match the target distribution, ignoring that the pre-trained model already covers common data, which leads to inefficiency and marginal gains in low-budget settings.

Why it matters:

Fine-tuning LLMs is costly; acquiring expert-annotated data for new tasks (e.g., safety interventions) is slow and expensive
Current methods fail to account for the pre-training distribution, selecting redundant data that contributes little to the model's adaptation
In low-selection-budget regimes (e.g., 50k samples), existing distribution-matching methods provide only marginal improvements

Concrete Example: When fine-tuning a model to reduce toxicity, standard methods might select generic safe text that the model already knows. GOT-D specifically identifies underrepresented samples that actively pull the model's internal distribution away from toxicity and toward the safe target.

Key Novelty

GOT-D (Gradients of Optimal Transport for Data Selection)

Instead of selecting data that simply matches the target distribution, GOT-D selects data that minimizes the distance between the 'effective' fine-tuned distribution (a mix of pre-training and new data) and the target.
It treats data selection as a gradient descent problem on the distribution space: identifying samples with the largest negative gradients w.r.t. the Optimal Transport distance to the target.
Uses the dual solution of the Optimal Transport problem to efficiently compute these gradients without retraining the model.

Architecture

Conceptual workflow of the two-stage fine-tuning approach involving data selection.

Evaluation Highlights

Reduces GPT-2 toxicity levels by 30% using only 10K selected samples compared to conventional fine-tuning
Improves average performance by +1.13% across 8 domain-specific tasks with 150K samples compared to baselines
Boosts zero-shot task performance by +13.9% with only 40k samples on models up to 2.7B parameters

Breakthrough Assessment

8/10

Offers a mathematically grounded approach (OT gradients) that fundamentally changes how data selection is viewed—from 'matching' to 'shifting' distributions. Significant efficiency gains in low-data regimes.

⚙️ Technical Details

Problem Definition

Setting: Select a subset of unlabeled data to pre-fine-tune a pre-trained model such that subsequent target fine-tuning minimizes loss.

Inputs: Pre-trained model M0, Candidate dataset DS (proxy for pre-training data), Target training data DR

Outputs: Optimal data subset DU

Pipeline Flow

Proxy Construction (Candidate Data DS)
OT Distance Computation (DS vs Target DR)
Gradient-based Selection
Pre-fine-tuning (Model Training)

System Modules

OT Solver (Selection Mechanism)

Compute Optimal Transport distance between Candidate Data and Target Data

Model or implementation: Optimization Algorithm (Sinkhorn/Entropy Regularization)

Selector (Selection Mechanism)

Identify samples that maximally reduce the OT distance

Model or implementation: Sorting/Ranking

Pre-fine-tuner

Update model weights on selected data

Model or implementation: LLM (e.g., GPT-2, models up to 2.7B)

Novel Architectural Elements

Integration of OT dual solution gradients directly into the data selection loop without model retraining

Modeling

Base Model: Models up to 2.7B parameters (e.g., GPT-2 mentioned for toxicity)

Training Method: Standard Fine-tuning (Pre-fine-tuning stage)

Objective Functions:

Purpose: Minimize OT distance between effective model distribution and target.

Formally: D_U* = argmin D_U dot Gradient(OT(D_S, D_R))

Training Data:

Candidate Data DS: Open sources (e.g., The Pile) acting as proxy for pre-training data
Target Data DR: Small curated dataset for downstream task

Compute: Data selection scales to millions of samples within a single GPU hour

Comparison to Prior Work

vs. Distribution Matching: GOT-D accounts for the pre-training distribution, selecting data that *moves* the model rather than just matching the target
vs. Continued Pre-training: GOT-D is designed for low-budget/constrained settings (e.g., 50k samples) rather than massive scale adaptation

Limitations

Relies on the assumption that the Candidate Dataset (DS) is a good proxy for the Pre-training Data (DP)
Not suitable for tasks requiring domain knowledge completely disjoint from the pre-training data (e.g., new programming language not in pre-training)
Effectiveness depends on the quality of the small target dataset (DR) used to guide selection

Reproducibility

Code: https://anonymous.4open.science/r/DV4LLM-D761/

Code is open-sourced at anonymous repository. Specific hyperparameters for the fine-tuning (LR, batch size) are not detailed in the provided text excerpt. The method relies on constructing a 'Candidate Dataset' DS which approximates the pre-training data DP.

📊 Experiments & Results

Evaluation Setup

Pre-fine-tuning on selected data followed by targeted fine-tuning (or zero-shot eval)

Benchmarks:

GPT-2 Toxicity Reduction (Safety/NLG)
Domain Adaptation Tasks (NLU (8 tasks from Gururangan et al. 2020))
Zero-shot Evaluation (General Capabilities)

Metrics:

Toxicity Level
Task Performance (Accuracy/F1)
Zero-shot Performance
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
GPT-2 Toxicity	Toxicity Reduction	0	30	30
8 Domain-specific tasks	Average Performance	Not reported in the paper	Not reported in the paper	1.13
Zero-shot tasks (models up to 2.7B)	Task Performance	Not reported in the paper	Not reported in the paper	13.9

Experiment Figures

Geometric interpretation of why distribution matching fails in fine-tuning.

Main Takeaways

GOT-D consistently outperforms existing selection methods, particularly in low-budget regimes (e.g., 10k-50k samples).
The method is computationally efficient, scaling to millions of candidate samples in minutes using GPU acceleration.
Visualizations (mentioned in text) show GOT-D selects samples underrepresented in pre-training but important for the target, confirming the theoretical intuition.

📚 Prerequisite Knowledge

Prerequisites

Optimal Transport (OT) distance
Kantorovich-Rubinstein Duality
Language Model Fine-tuning
Distributional alignment

Key Terms

Optimal Transport (OT): A mathematical method to measure the minimal effort required to transform one probability distribution into another

GOT-D: Gradients of Optimal Transport for Data Selection—the proposed method that uses OT gradients to select data

Pre-fine-tuning: An initial unsupervised training step on selected unlabeled data to prime the model before training on the final target task

Candidate dataset: A large pool of unlabeled, open-source data (e.g., The Pile) used as a source for selection and a proxy for the model's pre-training distribution

Targeted fine-tuning: The final stage of training using curated, high-quality labeled data for the specific downstream task

Dual solution: A component of the OT optimization solution that provides gradient information 'for free', indicating how changing the data weights affects the distance