sLSTM: Scalar LSTM—an updated LSTM with exponential gating and a normalizer state, permitting memory mixing but remaining sequential
mLSTM: Matrix LSTM—a variant using a matrix memory state updated via an outer product rule (covariance update), which is fully parallelizable due to lack of hidden-hidden mixing
exponential gating: Using exp() instead of sigmoid activation for input/forget gates, allowing the model to more aggressively revise or preserve memory states
covariance update rule: An update mechanism where the memory matrix is modified by adding the outer product of a value vector and a key vector (C_t = C_{t-1} + v_t k_t^T)
memory mixing: The interaction between hidden states from different memory cells (or heads) via recurrent weight matrices, crucial for state tracking
xLSTM block: A residual block wrapping either an sLSTM (with post up-projection) or mLSTM (with pre up-projection) into a standard deep learning backbone
BAM: Bidirectional Associative Memory—a type of recurrent network that stores pairs of vectors (keys and values) using correlation matrices
SlimPajama: A large-scale, deduplicated dataset for training large language models, derived from the RedPajama dataset
pre up-projection: A block design where inputs are projected to a high dimension *before* the core mixing/memory operation (used in mLSTM blocks)
post up-projection: A block design where the core operation happens in lower dimension, followed by projection to high dimension and back (used in sLSTM blocks, similar to Transformer FFNs)
FlashAttention: An algorithm that speeds up attention computation and reduces memory usage by optimizing GPU memory reads/writes (referenced here for parallel comparison)
state tracking: The ability of a model to maintain and update the status of entities or variables over time, often required for formal language tasks like parity or dyck languages