Boosting LLM via Learning from Data Iteratively and Selectively

📝 Paper Summary

Data Selection for Instruction Tuning Data Efficiency in LLM Training

IterIT improves instruction tuning by iteratively re-evaluating data complexity during training and greedily selecting diverse samples based on response content rather than just instructions.

Core Problem

Existing data selection methods compute static complexity scores before training, failing to adapt to the model's evolving capabilities, and often measure diversity based on instructions rather than informative responses.

Why it matters:

Models' difficulty perception changes during training; 55% of 'hard' samples after one epoch were not considered hard initially
Different instructions can yield similar, uninformative responses, reducing training efficiency
Selecting a small, high-quality subset can match or exceed full-dataset performance while significantly reducing training costs

Concrete Example: A model might find a physics problem difficult at epoch 0 but easy at epoch 1. Static selection methods would keep training on it, wasting compute, whereas IterIT detects the reduced difficulty and swaps it for a currently harder sample.

Key Novelty

Iterative Complexity-Diversity Selection (IterIT)

Re-calculates the Instruction-Following Difficulty (IFD) score for a candidate subset after every epoch to capture the model's dynamic learning progress
Measures diversity using TF-IDF on *responses* (not instructions) to ensure the selected subset covers diverse, informative answers
Uses a coarse-to-fine strategy: filters candidates globally first, then iteratively re-scores a smaller pool to keep computational cost affordable

Architecture

Conceptual flowchart of the IterIT process compared to static selection. It shows the loop of 'Model Training' -> 'Metrics Update' -> 'Data Selection' repeating for each epoch.

Evaluation Highlights

Outperforms training on the full Alpaca dataset (Vanilla) using only 5% of the data, achieving +1.25% on MixEval-Hard
Surpasses strong baselines like Deita and GraphFilter on average across 7 standard benchmarks (GSM8K, MMLU, etc.) when training LLaMA-3-8B
Demonstrates superior generalization on code generation, beating Vanilla on HumanEval and MBPP+ when training on CodeAlpaca

Breakthrough Assessment

7/10

Strong empirical results showing dynamic data selection beats static methods and even full-dataset training. The iterative re-scoring idea is intuitive and effective, though the computational overhead of re-inference is a trade-off.

⚙️ Technical Details

Problem Definition

Setting: Supervised Fine-Tuning (SFT) of a pre-trained LLM using a dynamically selected subset of instruction-response pairs

Inputs: A large pool of instruction-response pairs D = {(Xi, Yi)}

Outputs: An instruction-tuned model parameter set θ_T

Pipeline Flow

Complexity Assessment (Initial & Iterative)
Candidate Filtering (Coarse Selection)
Diversity-Aware Greedy Selection (Fine Selection)
Model Update (Training Epoch)

System Modules

Complexity Scorer (Selection)

Calculates the Instruction-Following Difficulty (IFD) score for samples

Model or implementation: Target model itself (e.g., LLaMA-3-8B)

Diversity Scorer (Selection)

Calculates diversity based on response n-grams using TF-IDF with dynamic weight decay

Model or implementation: Statistical TF-IDF (no neural model)

Iterative Selector (Selection)

Greedily selects samples maximizing a combined Complexity-Diversity score

Model or implementation: Algorithm 1 (Greedy Selection)

Novel Architectural Elements

Iterative update loop where data selection happens *inside* the training loop (between epochs) rather than once before training
Response-based diversity scoring (TF-IDF on answers) combined with dynamic weight decay to penalize redundancy

Modeling

Base Model: LLaMA-3-8B (also tested on Qwen-2.5-7B)

Training Method: Supervised Fine-Tuning (SFT) with iteratively selected data

Objective Functions:

Purpose: Maximize likelihood of generating answer Y given instruction X.

Formally: Maximize P_theta(Y|X)

Adaptation: Full fine-tuning (implied by context of SFT baselines)

Training Data:

Alpaca (52K)
Alpaca-GPT4 (52K)
WizardLM (70K)
Dolly (15K)
CodeAlpaca (20K)
IterIT selects 5% of total data per epoch

Key Hyperparameters:

learning_rate: 2e-5
batch_size: 32
epochs: 3
+ 3 more
selection_ratio: 5% (M=0.05*N)
candidate_pool_multiplier_a: 3
weight_decay_b: 0.1 (0.0 for Alpaca-GPT4)

Compute: Complexity score calculation requires inference on the dataset. Optimized to O(N + a*M*(epochs-1)) forward passes.

Comparison to Prior Work

vs. Superfiltering/GraphFilter: IterIT updates complexity scores during training to reflect model dynamics; others are static.
vs. GraphFilter: IterIT measures diversity on responses (informativeness) rather than instructions (topics).
vs. Deita: IterIT relies on the model itself for scoring, avoiding distillation from proprietary models like ChatGPT.
+ 1 more
vs. LESS [not cited in paper]: LESS selects data based on gradient similarity to a validation set; IterIT uses intrinsic model uncertainty (perplexity) and does not require a validation set.

Limitations

Requires re-inference on a subset of data after every epoch, adding computational overhead compared to static selection.
Greedy selection strategy may be suboptimal compared to global optimization.
Performance gain varies across benchmarks; does not win on every single metric (e.g., MMLU/ARC sometimes favored by baselines).

Reproducibility

Code: https://github.com/JiaQiSJTU/IterIT

Code is publicly available at https://github.com/JiaQiSJTU/IterIT. Uses standard datasets (Alpaca, WizardLM, etc.) and public benchmarks (Open-Instruct, Open LLM Leaderboard). IFD calculation uses the target model itself.

📊 Experiments & Results

Evaluation Setup

Instruction tuning on LLaMA-3-8B and Qwen-2.5-7B followed by zero-shot evaluation on standard benchmarks.

Benchmarks:

MixEval-hard-0601 (General user query handling)
MMLU (Factual knowledge)
GSM8K (Arithmetic reasoning)
HumanEval (Coding capability)

Metrics:

Accuracy (%)
Score (MixEval)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Overall performance on Alpaca dataset training shows IterIT surpassing baselines on MixEval and Average benchmarks.
MixEval	Score	37.52	38.15	+0.63
Average (7 tasks)	Average Score	50.18	51.10	+0.92
MixEval	Score	36.26	39.52	+3.26
Domain-specific generalization on CodeAlpaca shows IterIT improving coding performance.
MBPP+	pass@1	41.48	45.24	+3.76

Experiment Figures

Comparison of IFD score distributions at Epoch 0 vs Epoch 1

Main Takeaways

IterIT consistently outperforms full-dataset training (Vanilla) using only 5% of the data per epoch, demonstrating high data efficiency.
The method generalizes well to different datasets (Alpaca, WizardLM, CodeAlpaca) and models (LLaMA-3, Qwen-2.5), suggesting robustness.
Static data selection methods (Deita, Superfiltering) sometimes underperform Vanilla on high-quality datasets like WizardLM, whereas IterIT maintains superiority.
Longest response selection is a surprisingly strong baseline, often beating complex static metrics, but IterIT still surpasses it.

📚 Prerequisite Knowledge

Prerequisites

Instruction Tuning / Supervised Fine-Tuning (SFT)
Perplexity (PPL) as a metric for difficulty
TF-IDF for text diversity

Key Terms

IFD: Instruction-Following Difficulty—a metric measuring how much harder it is to generate a response given an instruction compared to generating it without the instruction

PPL: Perplexity—a measurement of how well a probability model predicts a sample; lower PPL indicates the model is more confident

MixEval: A benchmark designed to capture the breadth and subtlety of real-world user queries, highly correlated with Chatbot Arena

TF-IDF: Term Frequency-Inverse Document Frequency—a statistical measure used to evaluate how important a word is to a document in a collection

n-gram: A contiguous sequence of n items (words or characters) from a given sample of text or speech

coarse-to-fine: A strategy where a computationally cheap filter is applied to the whole dataset first, and expensive operations are performed only on the remaining candidates