Looped Model ($k \otimes L$): A model architecture where a block of $k$ unique transformer layers is applied iteratively $L$ times to the input representation
Iso-flop: Comparison between models that require the same number of floating-point operations (computation) during inference (e.g., a shallow looped model vs. a deep non-looped model)
Iso-param: Comparison between models that have the same number of trainable parameters (e.g., a shallow non-looped model vs. a shallow looped model)
Inductive Bias: Assumptions built into a learning algorithm that encourage it to learn certain types of solutions (here, reasoning processes) over others (like rote memorization)
Latent Thoughts: Intermediate hidden states generated during the loop iterations that represent reasoning steps, analogous to explicit tokens in Chain-of-Thought
Perplexity: A measurement of how well a probability model predicts a sample; lower values indicate better prediction of the training data distribution
i-GSM: A synthetic grade-school math dataset constructed as a Directed Acyclic Graph (DAG) of arithmetic operations