CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps in text
BPTT: Backpropagation Through Time—the standard algorithm for training RNNs by unrolling the network over time, which is memory-intensive
Deep Equilibrium Models (DEQ): Neural networks that find a fixed point (equilibrium) of a hidden layer and compute gradients using the Implicit Function Theorem instead of unrolling layers
ARC-AGI: Abstraction and Reasoning Corpus—a benchmark measuring general intelligence through few-shot solving of visual logic puzzles
Hierarchical convergence: The process where a low-level module converges to a local equilibrium conditioned on a high-level state, which then updates to restart the low-level process
One-step gradient: An approximation method that computes gradients at the equilibrium point using only the final state, avoiding the memory cost of storing history
Deep supervision: Training technique where the model predicts the output and computes loss at multiple intermediate steps (segments) rather than just at the end
Adaptive Computational Time (ACT): Mechanism allowing the model to dynamically decide when to stop 'thinking' (iterating internal states) based on a learned halting policy
RMSNorm: Root Mean Square Normalization—a normalization technique used in Transformers to stabilize training
AdamW: A variant of the Adam optimizer that decouples weight decay from gradient updates
Post-Norm: An architecture where normalization is applied after the residual connection, often used for stability in deep networks