Cesàro matrix: A lower-triangular matrix where non-zero entries in row i are 1/i, representing a cumulative average; models uniform causal attention at initialization
Primacy bias: The tendency of models to attend heavily to the first few tokens; mathematically proven here to be a logarithmic divergence caused by causal masking
Recency bias: The tendency of models to attend to the most recent tokens; proven here to be an isolated delta spike caused by residual connections
RoPE: Rotary Position Embeddings—a method of encoding position by rotating query/key vectors; shown here to be irrelevant to the initialization topology due to rotational symmetry
Jacobian: A matrix of all first-order partial derivatives of a vector-valued function; its norm measures how much the output changes given a change in input
Isotropic Gaussian: A distribution where variance is the same in all directions; used to model random weight initialization
SwiGLU: A specific activation function used in modern LLMs like Qwen2 and Llama
RMSNorm: Root Mean Square Normalization—a normalization technique used in transformers to stabilize training
Spearman correlation: A statistical measure of rank correlation (monotonic relationship) between two variables
Wasserstein distance: A distance measure between probability distributions, also known as Earth Mover's Distance