LLM360: An initiative to provide fully open-source LLMs with complete transparency, including training code, data, logs, and checkpoints.
Loss Spike: A sudden increase in the loss function during training, often indicating instability or divergence; K2 categorizes these as 'benign' (recoverable) or 'malignant' (requiring rollback).
TxT360: The fully open data curation pipeline and dataset developed by the authors, ensuring high-quality pretraining data.
RoPE: Rotary Positional Embeddings—a method for encoding position information in Transformers that generalizes well to longer sequences.
FlashAttention-2: An exact, memory-aware IO-aware attention algorithm that speeds up training and reduces memory usage.
AdamW: A stochastic optimization method that modifies the typical Adam implementation of weight decay to decouple it from the gradient update.
CoT: Chain-of-Thought—a prompting technique where the model is encouraged to generate intermediate reasoning steps before the final answer.
FLOPs: Floating Point Operations—a measure of compute cost; reducing FLOPs means the model is more efficient to train or run.
SFT: Supervised Fine-Tuning—the process of training a pre-trained base model on labeled instruction-response pairs.
GQA: Grouped Query Attention—an interpolation between multi-head and multi-query attention; notably NOT used in K2, which uses standard attention.