Memory Wall: The growing disparity between how fast processors can compute (FLOPS) and how fast memory can supply data (Bandwidth), causing processors to idle.
Arithmetic Intensity: The ratio of floating-point operations (FLOPs) performed per byte of data loaded from memory; higher intensity means the workload is more compute-bound and less memory-bound.
FLOPS: Floating Point Operations Per Second—a rate measure of hardware peak performance.
FLOPs: Floating Point Operations—a count of the total mathematical operations required for a specific task.
MOPs: Memory Operations—the total number of bytes accessed/transferred during a computation.
Encoder model: A Transformer architecture (e.g., BERT) that processes all input tokens simultaneously, enabling high-intensity matrix-matrix operations.
Decoder model: A Transformer architecture (e.g., GPT) that generates tokens one by one (auto-regressively), often relying on lower-intensity matrix-vector operations during inference.
Auto-regressive: A generation process where the model predicts the next token based on previous tokens, appending it to the sequence and repeating the process.
Hyperscalar: Large-scale cloud service providers (e.g., Google, Amazon, Microsoft) capable of massive distributed computing.
Rematerialization: A technique to reduce memory footprint by recomputing intermediate activations during the backward pass instead of storing them, trading compute for memory.