Diversity-driven data selection for language model tuning through sparse autoencoder

📝 Paper Summary

Data Selection for Instruction Tuning Interpretability of LLMs Sparse Autoencoders (SAEs)

This paper uses Sparse Autoencoders to extract atomic features from instruction data, enabling a quantifiable diversity measure that guides effective data selection for fine-tuning language models.

Core Problem

Instruction tuning datasets are saturated with quantity, making selection crucial, yet existing methods often ignore diversity or rely on flawed metrics like sentence embeddings which fail to capture atomic semantic features.

Why it matters:

Training on massive, redundant datasets is computationally expensive and inefficient
Existing diversity metrics (e.g., cosine similarity of embeddings) are often too coarse to capture fine-grained semantic distinctions
Recent industry calls (e.g., Llama-3 reports) highlight the need for quantifiable diversity measures rather than just claims

Concrete Example: A data selection method might group distinct instructions together because their overall sentence embeddings are similar, missing unique semantic nuances. The paper shows that simply selecting the longest responses often works well because length correlates (r=0.92) with feature richness, but this is a crude proxy compared to direct feature measurement.

Key Novelty

SAE-Driven Diversity Selection

Train a Sparse Autoencoder (SAE) on the residual stream of a language model to decompose text into atomic, monosemantic features
Use these extracted features to measure diversity directly, rather than relying on coarse proxies like sentence length or embedding similarity
Select data points that maximize the coverage of these unique semantic features (via greedy or similarity-based scaling algorithms)

Architecture

Pseudocode for SAE-GreedSelect and SAE-SimScale

Evaluation Highlights

+4.8 IFEval (Loose Instruction) score improvement using SAE-SimScale (50.96) compared to the #InsTag baseline (46.16) on WizardLM-70k data at 3k scale
Matches the performance upper bound of using the full 70k dataset while using only 3k selected samples
Outperforms commercial models like ChatGPT and Claude on AlpacaEval 2.0 when applied to Llama-2-13b-base with 5k selected data

Breakthrough Assessment

7/10

Novel application of SAEs (usually an interpretability tool) to data selection. Strong empirical results matching full-data performance with <5% of data. Significant for data-centric AI.

⚙️ Technical Details

Problem Definition

Setting: Data selection for Supervised Fine-Tuning (SFT)

Inputs: A large pool of instruction tuning data D_pool

Outputs: A selected subset S ⊂ D_pool that maximizes diversity and model performance

Pipeline Flow

Feature Extraction: Llama-3.1-8b → Residual Stream → SAE → Sparse Features
Data Selection: Pool D → SAE-GreedSelect / SAE-SimScale → Subset S
Fine-tuning: Base Model + Subset S → Fine-tuned Model

System Modules

Sparse Autoencoder (SAE)

Decompose residual stream activations into sparse interpretable features

Model or implementation: TopK-SAE (K=128) with 131,072 latent dimensions

Selection Algorithm (SAE-GreedSelect) (Data Selection)

Select a fixed number of data points to maximize feature coverage greedily

Model or implementation: Algorithm 1 (Greedy)

Selection Algorithm (SAE-SimScale) (Data Selection)

Select data based on similarity to scale up selection beyond greedy limits

Model or implementation: Algorithm 1 (Similarity-based)

Novel Architectural Elements

Use of Sparse Autoencoder activations as the primary metric for dataset diversity calculation
Two-stage selection strategy: SAE-GreedSelect for small budgets and SAE-SimScale for larger, scalable budgets

Modeling

Base Model: Llama-2-13b-base (primary), Gemma-2-9b, Llama-2-7b-base

Training Method: Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Optimize the SAE to reconstruct residual streams while maintaining sparsity.

Formally: L = ||x - x_hat||^2 (MSE reconstruction loss constrained by TopK)

Adaptation: Full fine-tuning

Trainable Parameters: Full model

Training Data:

Alpaca-52k
WizardLM_evol_instruct_70k

Key Hyperparameters:

learning_rate: 1e-5
batch_size: 128
weight_decay: 0.1
+ 3 more
warmup_ratio: 0.03
epochs: 15 (for 1k/3k data), 5 (for 5k), 3 (for full data)
max_length: 1024 (Alpaca), 2048 (WizardLM)

Compute: 8 Nvidia A100 80G GPUs

Comparison to Prior Work

vs. #InsTag: Uses unsupervised SAE features instead of pre-defined intention tags
vs. Longest-response: Selects based on semantic feature coverage rather than simple length heuristics (though length correlates with feature count)
vs. AlpaGasus: Focuses on diversity/complexity balance rather than just quality scoring

Limitations

SAE training requires a separate large corpus (RedPajama) and computational overhead before selection
Effectiveness diminishes on smaller models (Llama-2-7b) compared to larger ones (Llama-2-13b, Gemma-2-9b)
Selection depends on the quality of the trained SAE; interpretation of 'features' remains abstract

Reproducibility

Trained SAEs will be released. Code base uses Stanford Alpaca. Specific hyperparameters for SAE training (batch size 4096, lr 7e-5) and SFT are provided.

📊 Experiments & Results

Evaluation Setup

Instruction following capabilities measured via strict constraints and pairwise preference

Benchmarks:

IFEval (Strict instruction following (verifiable constraints))
AlpacaEval 2.0 (Pairwise preference (LLM-as-a-judge))
MMLU, TruthfulQA, Winogrande, Arc, Gsm8k (Knowledge and reasoning benchmarks)

Metrics:

Strict Prompt-level Accuracy
Strict Instruction-level Accuracy
Loose Prompt-level Accuracy
Loose Instruction-level Accuracy
Win Rate (AlpacaEval)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on WizardLM dataset (3k subset) shows SAE-SimScale surpassing diversity baselines.
IFEval	Loose instruction-level accuracy	46.16	50.96	+4.80
IFEval	Loose instruction-level accuracy	50.40	50.96	+0.56
IFEval	Strict prompt-level accuracy	23.66	24.21	+0.55
Correlation analysis explaining the 'Longest Response' baseline performance.
Alpaca	Pearson Correlation (r)	0	0.92	0.92

Experiment Figures

Scatter plot showing correlation between Text Length and Feature Number

Accuracy vs. SAE Inference Threshold for different data sizes

Main Takeaways

SAE-SimScale consistently outperforms greedy selection and baselines (Tagging, Longest) across data scales (1k, 3k, 5k).
Selecting data based on SAE features allows models trained on small subsets (3k) to match or exceed models trained on full datasets (70k).
The method is robust across different model architectures (Llama-2, Gemma-2) and SAE layers (Layer 29 vs 31).

📚 Prerequisite Knowledge

Prerequisites

Instruction Fine-Tuning (IFT)
Autoencoders and Sparse Representation
Transformer Architecture (Residual Streams)

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

SAE: Sparse Autoencoder—a neural network trained to decompose dense model activations into sparse, interpretable features

IFT: Instruction Fine-Tuning—the process of training pre-trained Large Language Models (LLMs) on dataset of instructions and responses to follow user commands

Residual Stream: The primary vector pathway in a Transformer model where information is added by attention and feed-forward layers

IFEval: Instruction Following Evaluation—a benchmark that measures a model's ability to follow verifiable constraints in instructions (e.g., 'no capitalization')

AlpacaEval 2.0: A benchmark using an LLM-based judge to compare model outputs against a reference model (usually GPT-4) on real-world user instructions

TopK Activation: An activation function that keeps only the K largest values in a vector and sets the rest to zero, enforcing sparsity

JumpReLU: An activation function that zeroes out values below a threshold and passes values above it linearly, used here to rectify SAE activations

Monosemanticity: The property of a neuron or feature responding to exactly one specific concept (e.g., a specific syntax or topic) rather than multiple unrelated concepts