MDM: Masked Diffusion Model—a generative model that adds noise by masking tokens and learns to reconstruct them
ARM: Autoregressive Model—standard language models that generate text one token at a time from left to right
SFT: Supervised Fine-Tuning—training a pre-trained model on labeled instruction-response pairs
Reversal Curse: The inability of autoregressive LLMs to generalize from 'A is B' to 'B is A' or generate text backwards due to their unidirectional training
FLOPs: Floating Point Operations—a measure of computational cost used here to analyze scaling laws
MMLU: Massive Multitask Language Understanding—a benchmark evaluating models on a wide range of subjects
GSM8K: Grade School Math 8K—a benchmark of high-quality grade school math word problems
In-context learning: The ability of a model to perform tasks based on examples provided in the prompt without parameter updates
KL divergence: Kullback-Leibler divergence—a statistical distance measure used in the loss function to align the model distribution with the data distribution