← Back to Paper List

FisherSFT: Data-Efficient Supervised Fine-Tuning of Language Models Using Information Gain

Rohan Deb, K. Thekumparampil, Kousha Kalantari, G. Hiranandani, Shoham Sabach, B. Kveton
International Conference on Machine Learning (2025)
Pretraining Factuality

📝 Paper Summary

Data Selection for Fine-tuning Optimal Experiment Design
FisherSFT improves the data efficiency of supervised fine-tuning by selecting the most informative training examples that maximize the determinant of the model's approximated Fisher Information Matrix.
Core Problem
Fine-tuning Large Language Models (LLMs) on large datasets is computationally expensive, and standard selection methods (random sampling, coverage, quality filtering) ignore the specific statistical information value of examples relative to the model parameters.
Why it matters:
  • The computational cost of fine-tuning is linear in the number of training examples, creating a need for methods that maintain performance with smaller datasets
  • Existing coverage-based or quality-based sampling methods optimize for dataset properties (like diversity) rather than the model's actual learning objective (maximizing likelihood)
  • Standard approaches treat sentences as single data points, ignoring the joint information value of the sequence of tokens within them
Concrete Example: In standard selection, a dataset might contain many sentences with similar, redundant embeddings that contribute little to parameter updates. FisherSFT analyzes the pre-logit embeddings to reject these redundant examples in favor of those with diverse embeddings that expand the volume of the Fisher Information Matrix (the design matrix), thereby maximizing information gain.
Key Novelty
Fisher Information-based Data Selection (FisherSFT)
  • Models the selection of fine-tuning data as an Optimal Design problem for multinomial logistic regression, using the LLM's pre-logit layer as feature vectors
  • Approximates the computationally intractable Hessian of the log-likelihood (Fisher Information) using a tensor product of pre-logit embeddings, reducing complexity from vocab-size dependency to embedding-size dependency
  • Uses a greedy algorithm with lazy evaluations (exploiting submodularity) to efficiently select the subset of sentences that maximizes the log-determinant of this approximated Hessian
Breakthrough Assessment
7/10
Applies classical optimal design theory to modern LLMs effectively. The reduction of the Hessian complexity to make Fisher information tractable for LLM vocabulary sizes is a significant methodological contribution.
×