MHSA: Multi-Head Self-Attention—the core component of Transformers that computes relationships between all tokens
FFN: Feed-Forward Network—a simpler layer usually consisting of two linear transformations and an activation
Memory-bound: Operations where execution speed is limited by how fast data can be moved between memory and the processor, rather than calculation speed
Sandwich Layout: A proposed block structure where one attention layer is placed between multiple FFN layers to minimize memory-heavy attention operations
CGA: Cascaded Group Attention—a novel attention mechanism where heads receive different splits of the input feature and outputs are cascaded
ONNX: Open Neural Network Exchange—an open format for representing machine learning models, often used for deployment
Taylor structured pruning: A method to estimate the importance of network channels using gradient-weight products to guide parameter allocation
Flops: Floating Point Operations—a theoretical measure of compute cost, often loosely correlated with actual latency