BPE: Byte Pair Encoding—a standard subword tokenization algorithm that iteratively merges frequent pairs of bytes/characters to form a fixed vocabulary
Subword tokenization: Splitting text into units larger than characters but smaller than words (e.g., 'ing', 'pre') to balance vocabulary size and sequence length
KV caching: Key-Value caching—storing previous attention computations to speed up autoregressive generation during inference
FLOPs: Floating Point Operations—a measure of computational cost
Backbone: The central, largest part of the transformer model that processes word-level embeddings
Autoregressive loop: A generation process where the model predicts one element at a time, feeding the prediction back as input for the next step
Llama: A family of open-source large language models developed by Meta, used here as the architectural baseline
Perplexity: A measurement of how well a probability model predicts a sample; lower values indicate better performance
Gradient accumulation: A technique to simulate larger batch sizes by accumulating gradients over multiple forward/backward passes before updating weights