ALiBi: Attention with Linear Biases—a positional encoding scheme that biases attention scores based on the distance between tokens using static slopes
BOS sink: An attention head that directs the vast majority of its attention mass to the Beginning-of-Sequence (BOS) token, often rendering it functionally useless for context processing
BOS mass: The fraction of total attention weight a head assigns to the position 0 (start) token
Gradient masking: A technique where gradients for specific parameters are zeroed out during backpropagation, effectively freezing those weights while allowing others to train
Entropy: Shannon entropy measured on the attention distribution; low entropy indicates the head is focusing on very few tokens (often just the BOS)
Xavier normal initialization: A method of initializing neural network weights with random values drawn from a normal distribution scaled by the layer size, used here to reset collapsed heads
PPL: Perplexity—a metric measuring how well a probability model predicts a sample; lower values indicate better prediction