ICL: In-Context Learning—the ability of a model to solve new tasks based solely on the prompt context without parameter updates
Zero-shot: Evaluating a model on a task without providing any examples of that task in the prompt
SlimPajama: A large-scale, deduplicated, and cleaned open-source dataset for training large language models
AO-Childes: A vocabulary derived from transcripts of child-directed speech, used here to define 'simple' language
Perplexity: A measurement of how well a probability model predicts a sample; lower values indicate better prediction
Emergent abilities: Capabilities (like reasoning or ICL) that appear suddenly only after models reach a certain scale (parameters/compute)
Zipfian Coefficient: A measure of the frequency distribution of words; a value near -1 indicates a natural language distribution
RoPE: Rotary Positional Embeddings—a method for encoding position information in Transformers that generalizes well to varying sequence lengths
Flash Attention: An IO-aware exact attention algorithm that speeds up training and reduces memory usage
BPE: Byte Pair Encoding—a tokenization method that iteratively merges frequent pairs of characters