The Best Instruction-Tuning Data are Those That Fit

📝 Paper Summary

Instruction Tuning Data Selection

GRAPE improves supervised fine-tuning by selecting training responses that have the highest length-normalized probability under the base model, ensuring data aligns with the model's pre-trained knowledge.

Core Problem

Standard instruction tuning often pairs prompts with responses from much stronger models (e.g., GPT-4) that are out-of-distribution for the target base model.

Why it matters:

Training on data that deviates significantly from the base model's distribution can cause diminishing returns or performance degradation despite scaling data size
Misaligned supervision risks catastrophic forgetting and the learning of spurious correlations rather than robust reasoning capabilities

Concrete Example: When a small base model is forced to imitate a complex reasoning path from Llama-3.1-405B, it may fail to generalize because the reasoning style is 'perplexing' (high loss) to it. GRAPE would instead select a simpler, correct response that the base model assigns high probability to, facilitating effective learning.

Key Novelty

GRAPE (Generative Response Alignment for Pre-trained models)

Instead of selecting the 'hardest' or 'highest quality' response, GRAPE selects the response that the base model itself finds most probable (lowest perplexity).
It acts as a filter that tailors a diverse pool of candidate answers to the specific 'taste' (distribution) of the model being fine-tuned.

Architecture

Conceptual flowchart of the GRAPE data selection process

Evaluation Highlights

Outperforms the 'Strongest-Model Responses' baseline (using only Llama-3.1-405B data) by up to +13.8% absolute gain across reasoning benchmarks.
Surpasses a baseline trained on 3x more data by up to +17.3%, proving that distributional alignment matters more than raw data scale.
Allows Llama-3.1-8B to beat the Tulu-8B-SFT benchmark by +3.5% while using only 1/3 of the data and half the training epochs.

Breakthrough Assessment

8/10

Simple, compute-efficient method that challenges the 'stronger teacher is better' dogma in distillation. Demonstrates significant gains with less data/compute by prioritizing distributional fit.

⚙️ Technical Details

Problem Definition

Setting: Supervised Fine-Tuning (SFT) of a pre-trained language model

Inputs: A set of instructions {x_i} and a pool of candidate responses {y_i^j} for each instruction

Outputs: A fine-tuned model parameter set θ optimized on the selected subset of pairs (x_i, y_i*)

Modeling

Base Model: Evaluated on Llama-3.1-8B, Llama-3.2-3B, Mistral-7B, and Qwen2.5-7B

Training Method: GRAPE (Data Selection) followed by Standard SFT

Objective Functions:

Purpose: Select the best response y* from candidates for instruction x.

Formally: y* = argmax_y (1/N * sum(log P(y_t | x, y_<t))) under the base model θ_0.
Purpose: Standard SFT optimization.

Formally: Minimize negative log-likelihood of selected responses.

Trainable Parameters: Full model parameters (standard SFT)

Training Data:

Source Instructions: UltraInteract-SFT (~80k instructions)
Candidate Pool: Original responses + samples from Mixtral-7x7B, Codestral-22B, Llama-3.1-70B/405B, Qwen2.5-72B (approx 10x responses/instruction)
Selection: Rank candidates by base model's length-normalized log-probability; keep top-1.

Key Hyperparameters:

learning_rate: 1e-5
epochs: 1
batch_size: Not explicitly reported in the paper

Compute: Requires a single forward pass over the candidate dataset to compute probabilities (inference only), followed by standard SFT.

Comparison to Prior Work

vs. AlpaGasus/IFD: GRAPE selects *responses* for a fixed instruction set based on model fit, whereas others select *instructions* based on difficulty or quality.
vs. Standard Distillation: GRAPE suggests the strongest teacher (e.g., 405B) is not always the best source; simpler, in-distribution responses are often better.
vs. BT-SFT [not cited in paper]: Back-translation SFT generates instructions for data; GRAPE filters responses for existing instructions.

Limitations

Requires a pool of pre-existing candidate responses or the compute to generate them from multiple teachers.
Relying solely on base model probability might reinforce existing biases or misconceptions if the model is too weak (though experiments show gains even on smaller models).
Does not filter instructions, only responses; assumes the instruction set itself is high-quality.

Reproducibility

Method is algorithmically simple (perplexity ranking). Base datasets (UltraInteract, Tulu3, Olmo-v2 post-training data) are generally available. Candidate generation requires inference access to multiple large models (Llama-405B, Qwen-72B, etc.). Exact code URL not provided in text.

📊 Experiments & Results

Evaluation Setup

Controlled fine-tuning on reasoning tasks followed by broad benchmarking.

Benchmarks:

HumanEval (Coding (Python generation))
MATH (Mathematics (High school competition))
GSM-Plus (Harder variant of GSM8K math reasoning)
TheoremQA (STEM theorem application)
LeetCode (Interview-level programming)

Metrics:

Pass@1 Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
GRAPE significantly outperforms standard baselines and stronger teacher models on aggregated benchmarks.
Average across benchmarks	Accuracy Gain	Not explicitly reported as single aggregate number (implied baseline)	Not explicitly reported as single aggregate number	+13.8%
Average across benchmarks	Accuracy Gain	Not explicitly reported as single aggregate number	Not explicitly reported as single aggregate number	+17.3%
Tulu3 Benchmark Suite	Average Performance	Not reported as exact number in text	Not reported as exact number in text	+3.5%
General-domain Instruction Tuning	Average Performance	Not reported as exact number in text	Not reported as exact number in text	+3.9%

Experiment Figures

Comparison of response choices made by different base models from the same candidate pool

Main Takeaways

Alignment with the base model's pre-trained distribution is a critical, often overlooked factor in SFT data selection.
Data quantity has diminishing returns; selecting 'fitting' data outperforms simply scaling up data volume by 3x.
The 'strongest' teacher (e.g., 405B model) does not necessarily produce the best training data for smaller models; the gap can be significant.
GRAPE is efficient: it requires only inference (forward passes) to select data, with no complex iterative training loops.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Supervised Fine-Tuning (SFT)
Concept of Language Model Perplexity/Likelihood
Knowledge Distillation

Key Terms

SFT: Supervised Fine-Tuning—retraining a pre-trained model on labeled instruction-response pairs to adapt it for following user commands

distributional alignment: Ensuring the training data comes from a probability distribution similar to what the model already knows, reducing the 'shock' of learning

perplexity: A measurement of how surprised a model is by a sequence of text; lower perplexity means the model finds the text more probable/natural

chain-of-thought: A prompting technique where the model produces intermediate reasoning steps before the final answer

length-normalized probability: The total log-probability of a sequence divided by its length, used to compare likelihoods of responses with different lengths fairly