OPT: Open Pre-trained Transformer—the suite of models released in this paper
FSDP: Fully Sharded Data Parallel—a memory efficiency technique that shards model parameters, gradients, and optimizer states across GPUs
Tensor Parallelism: Splitting individual tensor operations across multiple GPUs (e.g., Megatron-LM style) to fit large layers in memory
Dynamic Loss Scaling: Adjusting the scaling factor for loss values during mixed-precision training to prevent underflow of small gradients
Zero-shot learning: Evaluating a model on a task without providing any examples in the prompt
Few-shot learning: Evaluating a model by providing a few examples (demonstrations) in the prompt context
Pile: A large-scale, diverse open-source dataset for language model training
MinhashLSH: MinHash Locality Sensitive Hashing—an algorithm used for detecting and removing near-duplicate documents in the dataset
Megatron-LM: A highly optimized library for training large transformer models on NVIDIA GPUs
AdamW: A variant of the Adam optimizer that decouples weight decay from the gradient update