SFT: Supervised Fine-Tuning—adapting a pre-trained model to a specific task using labeled examples
Phased Training: A curriculum learning strategy where the model is trained on data of increasing difficulty or different domains in sequential stages
Stacked Training: A simpler strategy where all datasets from different domains or difficulty levels are combined into a single training mix
Effective Batch Size: The total number of samples processed before a model weight update, often achieved by accumulating gradients across multiple smaller micro-batches
Gradient Norm: The magnitude of the gradient vector during training; used here as an indicator of training stability and potential final performance
TULU: A reference configuration and dataset from Wang et al. (2023b), often considered a baseline for open-source instruction tuning
LAB: Large-scale Alignment for chatBots—a method and dataset focusing on knowledge and skills data, used as a primary configuration baseline here
MTBench: A benchmark for evaluating the conversational and instruction-following capabilities of LLMs using multi-turn questions
MMLU: Massive Multitask Language Understanding—a benchmark measuring a model's knowledge across 57 diverse subjects