WHT: Walsh-Hadamard Transform—a non-sinusoidal orthogonal transform that decomposes signals into rectangular waves (Walsh functions) using only additions and subtractions
MHA: Multi-Head Attention—the core mechanism in Transformers allowing the model to attend to different parts of the sequence simultaneously
Butterfly factorization: A recursive algorithmic structure (like in FFT) that reduces computational complexity from $O(n^2)$ to $O(n \log n)$
RoPE: Rotary Positional Embeddings—a method for encoding position information by rotating the query and key vectors in the embedding space
SwiGLU: Swish Gated Linear Unit—an activation function variant used in modern LLMs (like Llama) for better performance
FLOPs: Floating Point Operations—a measure of computational cost
Bfloat16: Brain Floating Point Format—a 16-bit floating point format with the same dynamic range as 32-bit float, commonly used in ML training
HBM: High Bandwidth Memory—the fast memory on GPUs where model weights and KV caches are stored