LM head: The final linear layer of a language model that projects hidden states to the vocabulary size, followed by a softmax
softmax bottleneck: The phenomenon where a model cannot represent high-rank probability distributions because the hidden dimension is much smaller than the vocabulary size
gradient bottleneck: The proposed theory that the low-rank LM head compresses and destroys gradient information during the backward pass
logits: The raw, unnormalized scores output by the final linear projection before the softmax function is applied
SVD: Singular Value Decomposition—a mathematical method to factorize a matrix, used here to analyze the rank and information content of gradients
rank: The dimension of the vector space generated by the columns (or rows) of a matrix; a low-rank matrix has limited information capacity
Eckart-Young-Mirsky theorem: A theorem stating that the best low-rank approximation of a matrix is obtained by keeping its largest singular values; used here to quantify gradient information loss