DELIFT: Data Efficient Language model Instruction Fine Tuning

📝 Paper Summary

Data Selection for Fine-Tuning Instruction Tuning Continual Learning

DELIFT selects diverse, high-value training subsets for all fine-tuning stages using a pairwise utility metric based on how much one sample improves the model's prediction of another.

Core Problem

Fine-tuning LLMs is resource-intensive due to redundant data, and existing selection methods are either too computationally expensive (gradient-based) or fail to adapt to the model's evolving state (static embeddings).

Why it matters:

Current methods struggle to scale with large models and datasets, limiting broader deployment
Existing approaches typically target single fine-tuning stages, failing to unify instruction tuning, task adaptation, and continual learning
Static selection ignores how a model's knowledge shifts, leading to suboptimal data subsets

Concrete Example: In a task-specific fine-tuning scenario, a static embedding method might select samples semantically similar to the target task but which the model already knows well, wasting compute. DELIFT would instead select samples that actually improve the model's prediction capability on the target set, filtering out redundant known examples.

Key Novelty

Information-Theoretic In-Context Utility

Measures the 'information gain' of a training sample by calculating how much its presence as an in-context example reduces the loss on another sample
Integrates these pairwise utility scores into a submodular optimization framework to balance diversity and informativeness
Adapts the selection objective (Facility Location variants) for specific stages: general coverage for instruction tuning, target alignment for task adaptation, and redundancy penalization for continual learning

Evaluation Highlights

Reduces fine-tuning data requirements by up to 70% without compromising performance across multiple datasets and model scales
Outperforms existing methods (random, clustering, influential selection) by up to 26% in effectiveness and efficiency
Demonstrates consistent gains across Instruction Tuning, Task-Specific Fine-Tuning, and Continual Fine-Tuning stages

Breakthrough Assessment

8/10

Offers a strong, unified solution to the critical problem of data efficiency. The theoretical link between in-context learning utility and data selection is novel and practically effective, though computational cost of pairwise scoring is a factor.

⚙️ Technical Details

Problem Definition

Setting: Subset selection for LLM fine-tuning across three stages: Instruction Tuning, Task-Specific Adaptation, and Continual Fine-Tuning.

Inputs: Full training dataset D = {(x_i, y_i)}

Outputs: Selected subset A (subset of D) of size k

Pipeline Flow

Utility Computation: Calculate pairwise utility matrix U_F for all sample pairs
Kernel Construction: Create kernel matrix s_ij from utilities
Submodular Optimization: Select subset A using greedy algorithm on stage-specific objective

System Modules

Utility Scorer

Compute pairwise information gain using the model's inference capabilities

Model or implementation: Pre-trained LLM (e.g., Llama-2-7b)

Submodular Optimizer

Select the optimal subset based on the computed utilities and specific stage requirements

Model or implementation: Greedy Selection Algorithm

Novel Architectural Elements

Unified utility metric derived from in-context learning performance used as a proxy for training value
Integration of PMI-based utility into submodular facility location variants for different fine-tuning regimes

Modeling

Base Model: Evaluated on various scales, e.g., Llama-2-7b, Llama-2-13b

Training Method: Instruction Fine-Tuning (SFT) on selected subsets

Objective Functions:

Purpose: Instruction Tuning Selection.

Formally: Facility Location (FL) maximizes coverage of D: f_FL(A) = sum_{i in D} max_{j in A} s_ij
Purpose: Task-Specific Selection.

Formally: Facility Location Mutual Information (FLMI) targets D_T: f_FLMI(A; D_T) = f_FL(A) + eta * sum_{j in A} max_{i in D_T} s_ij
Purpose: Continual Learning Selection.

Formally: Facility Location Conditional Gain (FLCG) avoids redundancy with D_E: f_FLCG(A | D_E) = sum_{i in D} max(max_{j in A} s_ij - nu * max_{k in D_E} s_ik, 0)

Compute: Utility computation is O(n^2) but amortized. Subset selection is fast via lazy greedy optimization.

Comparison to Prior Work

vs. LESS: DELIFT avoids expensive gradient computations by using inference-based in-context utility
vs. Clustering: DELIFT captures model-specific pairwise interactions rather than just semantic similarity
vs. DEITA: DELIFT uses a combinatorial optimization approach (submodular) rather than simple threshold filtering [not cited in paper]

Limitations

Pairwise utility computation has O(n^2) complexity, which may be costly for extremely large datasets (though amortized)
Depends on the base model's ability to perform in-context learning to estimate utility accurately
Does not explicitly address pre-training data selection, focusing only on fine-tuning stages

Reproducibility

Code: https://github.com/agarwalishika/delift

Code is publicly available at https://github.com/agarwalishika/delift. The paper provides mathematical formulations for the utility metrics and submodular objectives.

📊 Experiments & Results

Evaluation Setup

Fine-tuning LLMs on selected data subsets and evaluating on held-out benchmarks

Benchmarks:

Alpaca (Instruction Following)
GSM8K (Mathematical Reasoning)
MMLU (General Knowledge)

Metrics:

Accuracy
Win Rate (vs. full data or baselines)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
General results demonstrating data efficiency and performance gains.
General	Data Reduction %	100	30	-70
General	Effectiveness/Efficiency Gain	0	26	+26

Main Takeaways

Utility-driven selection is far more effective than sheer data volume, allowing massive data pruning (up to 70%) without loss.
The single framework adapts successfully to all three stages (Instruction Tuning, Task Adaptation, Continual Learning) by simply changing the submodular objective function.
Information-theoretic utility (PMI via In-Context Learning) provides a better signal for selection than static embeddings or expensive gradient proxies.

📚 Prerequisite Knowledge

Prerequisites

Language Model Fine-Tuning (Instruction Tuning, SFT)
Submodular Optimization
Information Theory (Pointwise Mutual Information)
In-Context Learning (ICL)

Key Terms

Submodular Optimization: A method for selecting subsets that naturally models diminishing returns, ensuring diversity and coverage (like selecting locations for facilities to cover a city)

PMI: Pointwise Mutual Information—a measure of how much knowing one event tells you about another

In-Context Learning (ICL): The ability of LLMs to perform tasks by seeing examples within the prompt context without weight updates

Facility Location: A specific submodular function that maximizes the sum of similarities between each data point and its most similar representative in the selected subset

KL divergence: A statistical distance measuring how one probability distribution differs from a second, reference probability distribution

Teacher Forcing: A training method where the model is fed the actual previous ground truth tokens as input for the next prediction, rather than its own generated guesses

Greedy Heuristic: An algorithm that makes the locally optimal choice at each stage (picking the single best next item) to approximate a global optimum