← Back to Paper List

DELIFT: Data Efficient Language model Instruction Fine Tuning

Ishika Agarwal, Krishnateja Killamsetty, Lucian Popa, Marina Danilevksy
University of Illinois Urbana-Champaign, IBM Research
arXiv (2024)
Pretraining Reasoning QA

📝 Paper Summary

Data Selection for Fine-Tuning Instruction Tuning Continual Learning
DELIFT selects diverse, high-value training subsets for all fine-tuning stages using a pairwise utility metric based on how much one sample improves the model's prediction of another.
Core Problem
Fine-tuning LLMs is resource-intensive due to redundant data, and existing selection methods are either too computationally expensive (gradient-based) or fail to adapt to the model's evolving state (static embeddings).
Why it matters:
  • Current methods struggle to scale with large models and datasets, limiting broader deployment
  • Existing approaches typically target single fine-tuning stages, failing to unify instruction tuning, task adaptation, and continual learning
  • Static selection ignores how a model's knowledge shifts, leading to suboptimal data subsets
Concrete Example: In a task-specific fine-tuning scenario, a static embedding method might select samples semantically similar to the target task but which the model already knows well, wasting compute. DELIFT would instead select samples that actually improve the model's prediction capability on the target set, filtering out redundant known examples.
Key Novelty
Information-Theoretic In-Context Utility
  • Measures the 'information gain' of a training sample by calculating how much its presence as an in-context example reduces the loss on another sample
  • Integrates these pairwise utility scores into a submodular optimization framework to balance diversity and informativeness
  • Adapts the selection objective (Facility Location variants) for specific stages: general coverage for instruction tuning, target alignment for task adaptation, and redundancy penalization for continual learning
Evaluation Highlights
  • Reduces fine-tuning data requirements by up to 70% without compromising performance across multiple datasets and model scales
  • Outperforms existing methods (random, clustering, influential selection) by up to 26% in effectiveness and efficiency
  • Demonstrates consistent gains across Instruction Tuning, Task-Specific Fine-Tuning, and Continual Fine-Tuning stages
Breakthrough Assessment
8/10
Offers a strong, unified solution to the critical problem of data efficiency. The theoretical link between in-context learning utility and data selection is novel and practically effective, though computational cost of pairwise scoring is a factor.
×