Zero-shot: Evaluating a model on a task it was not explicitly trained for, without any gradient updates or fine-tuning on that task's training data.
Perplexity (PPL): A measurement of how well a probability model predicts a sample; lower is better. It is the exponentiated average negative log-likelihood per token.
WebText: A dataset created by scraping 45 million outbound links from Reddit that received at least 3 karma, emphasizing human-curated quality over raw scale.
BPE (Byte Pair Encoding): A tokenization method that iteratively merges frequent pairs of characters (or bytes) to form a vocabulary that interpolates between character-level and word-level representation.
Transformer: A neural network architecture relying entirely on self-attention mechanisms to draw global dependencies between input and output.
Layer Normalization: A technique to normalize the inputs across the features for each training example, stabilizing the learning process.
Greedy decoding: A generation strategy where the model selects the highest probability token at each step.