← Back to Paper List

Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

Feiyang Kang, H. Just, Yifan Sun, Himanshu Jahagirdar, Yuanzhi Zhang, Rongxing Du, Anit Kumar Sahu, Ruoxi Jia
International Conference on Learning Representations (2024)
Pretraining Benchmark

📝 Paper Summary

Data Selection for LLMs Efficient Fine-Tuning
GOT-D selects pre-fine-tuning data by calculating which samples best shift the pre-training distribution toward the target distribution using gradients of the Optimal Transport distance.
Core Problem
Existing data selection methods prioritize samples that match the target distribution, ignoring that the pre-trained model already covers common data, which leads to inefficiency and marginal gains in low-budget settings.
Why it matters:
  • Fine-tuning LLMs is costly; acquiring expert-annotated data for new tasks (e.g., safety interventions) is slow and expensive
  • Current methods fail to account for the pre-training distribution, selecting redundant data that contributes little to the model's adaptation
  • In low-selection-budget regimes (e.g., 50k samples), existing distribution-matching methods provide only marginal improvements
Concrete Example: When fine-tuning a model to reduce toxicity, standard methods might select generic safe text that the model already knows. GOT-D specifically identifies underrepresented samples that actively pull the model's internal distribution away from toxicity and toward the safe target.
Key Novelty
GOT-D (Gradients of Optimal Transport for Data Selection)
  • Instead of selecting data that simply matches the target distribution, GOT-D selects data that minimizes the distance between the 'effective' fine-tuned distribution (a mix of pre-training and new data) and the target.
  • It treats data selection as a gradient descent problem on the distribution space: identifying samples with the largest negative gradients w.r.t. the Optimal Transport distance to the target.
  • Uses the dual solution of the Optimal Transport problem to efficiently compute these gradients without retraining the model.
Architecture
Architecture Figure Figure 1
Conceptual workflow of the two-stage fine-tuning approach involving data selection.
Evaluation Highlights
  • Reduces GPT-2 toxicity levels by 30% using only 10K selected samples compared to conventional fine-tuning
  • Improves average performance by +1.13% across 8 domain-specific tasks with 150K samples compared to baselines
  • Boosts zero-shot task performance by +13.9% with only 40k samples on models up to 2.7B parameters
Breakthrough Assessment
8/10
Offers a mathematically grounded approach (OT gradients) that fundamentally changes how data selection is viewed—from 'matching' to 'shifting' distributions. Significant efficiency gains in low-data regimes.
×