Recurrent depth: Applying the same transformer block multiple times in a loop to process the same tokens, effectively increasing network depth without adding parameters
Latent space: The internal vector representation of data within a neural network, as opposed to the discrete token space of words
Test-time compute: The amount of computation (FLOPs) used during inference (generating answers), which can be increased to improve performance
Chain-of-Thought: A technique where models generate intermediate reasoning steps in text before producing the final answer
RoPE: Rotary Positional Embeddings—a method for encoding token positions in transformers using rotation matrices
RMSNorm: Root Mean Square Normalization—a normalization technique used to stabilize training in deep neural networks
KV-cache: Key-Value cache—storing calculated attention keys and values to speed up generation, which this model can share across recurrent steps
SiLU: Sigmoid Linear Unit—an activation function used in the model's MLPs
Effective depth: The total number of layers the data passes through, calculated as (prelude layers) + (recurrent layers × iterations) + (coda layers)