ProRes: Progressive Residual Warmup—the proposed method of scaling residual contributions from 0 to 1 over time, later for deeper layers
Pre-LN: Pre-Layer Normalization—applying normalization before the sub-layer (Attention/MLP) inside the residual block
Post-LN: Post-Layer Normalization—applying normalization after the residual connection
Sandwich-LN: An architecture adding extra normalization layers to bound values, improving stability but sometimes limiting expressivity
DeepNorm: An initialization and normalization scaling method designed to stabilize extremely deep Transformers
SwiGLU: A gated activation function combining Swish and GLU, commonly used in Llama architectures
RoPE: Rotary Position Embedding—a relative position encoding method that rotates query and key vectors
Perplexity: A measurement of how well a probability model predicts a sample; lower is better