← Back to Paper List

Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs

Aldo Pareja, Nikhil Shivakumar Nayak, Hao Wang, Krishnateja Killamsetty, Shivchander Sudalairaj, Wenlong Zhao, Seungwook Han, Abhishek Bhandwaldar, Guangxuan Xu, Kai Xu, Ligong Han, Luke Inglis, Akash Srivastava
MIT-IBM Watson AI Lab, IBM Research, Massachusetts Institute of Technology, University of Massachusetts Amherst
arXiv (2024)
Pretraining Reasoning Benchmark

📝 Paper Summary

Supervised Fine-Tuning (SFT) Small Language Models (SLMs)
Contrary to common practices like TULU, fine-tuning small LLMs benefits significantly from larger batch sizes and lower learning rates, while simpler stacked training matches complex phased strategies.
Core Problem
Practitioners with limited resources lack clear guidance on optimal hyperparameters and strategies for fine-tuning small LLMs (3B-7B), often relying on practices from larger models or incomplete reports.
Why it matters:
  • Small organizations and individual developers cannot afford the extensive grid searches required to find optimal settings
  • Commonly accepted 'gold standard' configurations (e.g., TULU's small batch sizes) may be sub-optimal for generalization
  • Detailed technical reports on failed experiments and critical factors like batch size are rarely published by open-source initiatives
Concrete Example: The widely cited TULU framework uses a batch size of 128. This paper finds that increasing the batch size to ~3,840 or ~7,680 significantly improves downstream benchmark scores for small models, suggesting the standard practice leaves performance on the table.
Key Novelty
Systematic Large-Batch SFT Recipe for Small Models
  • Demonstrates that large batch sizes (approx 4k-8k) coupled with lower learning rates yield better generalization than standard small-batch approaches
  • Identifies early-stage training metrics (gradient norms, loss values) that reliably predict final performance, allowing early pruning of bad runs
  • Shows that 'stacked' training (mixing all data at once) is just as effective as complex 'phased' training (curriculum learning) but more sample efficient
Evaluation Highlights
  • Large batch sizes (up to 7,680) combined with lower learning rates consistently improve performance on MMLU and MTBench compared to standard small batches (128)
  • Stacked training achieves similar performance to phased training but is more sample efficient, simplifying the pipeline
  • Omitting warmup steps and using constant learning rates does not compromise performance, challenging the necessity of complex schedules like cosine decay
Breakthrough Assessment
7/10
Provides highly practical, empirically grounded insights that contradict established norms (TULU). While not a new architecture, it offers a rigorous 'recipe' that democratizes effective fine-tuning for resource-constrained practitioners.
×