Looped Transformer: A transformer where the same layer weights are applied iteratively to the hidden state multiple times
Adaptive Looping: A mechanism where the model learns a probability distribution for halting at each step, rather than looping a fixed number of times
Iso-FLOP: A baseline model scaled to match the floating-point operations (compute cost) of the proposed model, typically by having more layers
Iso-Parameter: A baseline model scaled to match the total parameter count of the proposed model, typically by increasing width
BPB: Bits-per-byteβa normalized version of log-likelihood used to evaluate language modeling performance; lower is better
FineWeb-Edu: A large-scale dataset of educational web content used for pre-training language models
QK-normalization: Applying layer normalization to Queries and Keys before the dot product in attention to stabilize training
Halting Router: A small MLP that predicts the probability of stopping the loop iteration at the current step