xLSTM: Extended Long Short-Term Memory

📝 Paper Summary

RNN scaling Efficient sequence modeling

xLSTM revitalizes the LSTM architecture by introducing exponential gating and a matrix memory structure, enabling parallelization and scaling to large language modeling tasks competitive with Transformers.

Core Problem

Original LSTMs suffer from inability to revise storage decisions, limited scalar memory capacity for rare tokens, and lack of parallelizability due to sequential memory mixing.

Why it matters:

Standard LSTMs cannot compete with Transformers at scale due to sequential processing bottlenecks and memory compression issues
Transformers have quadratic complexity in context length, creating a need for linear-complexity alternatives that still maintain high performance
Current linear attention and SSM approaches often lack the state tracking capabilities provided by explicit memory mixing in RNNs

Concrete Example: In a Nearest Neighbor Search task, a standard LSTM struggles to revise a stored value when a more similar vector appears later in the sequence (high MSE), whereas xLSTM can overwrite the value using exponential gating.

Key Novelty

xLSTM (Extended LSTM) with sLSTM and mLSTM blocks

Introduces exponential gating (replacing sigmoid) to allow sharper focus and the ability to revise storage decisions (exponential decay/update)
Expands memory from scalar to matrix form (mLSTM) using a covariance update rule (outer product of key-value), enabling high-capacity storage similar to key-value pairs in Transformers
Eliminates hidden-hidden connections in the matrix memory variant (mLSTM) to enable fully parallelizable training via matrix operations

Architecture

Evolution from original LSTM to xLSTM architecture. Shows the original cell, the new sLSTM and mLSTM cells, the residual blocks, and the stacked architecture.

Evaluation Highlights

xLSTM[1:0] achieves 13.43 validation perplexity on SlimPajama (15B tokens), outperforming Mamba (13.70) and Llama (14.25) at comparable ~400M parameter sizes
In length extrapolation tests (training on 2k, testing up to 16k), 1.3B xLSTM models maintain low perplexity (~8.92-9.01) while Llama degrades significantly (337.83)
xLSTM[1:0] achieves lower perplexity than Mamba on 568 out of 571 (99.5%) text domains in the PALOMA benchmark

Breakthrough Assessment

9/10

Successfully modernizes the LSTM to be parallelizable and scalable, outperforming strong baselines like Mamba and Llama on language modeling tasks while retaining linear complexity.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive language modeling (next token prediction) and synthetic sequence tasks requiring state tracking and associative recall

Inputs: Sequence of tokens x_t

Outputs: Predicted probability distribution for next token x_{t+1}

Pipeline Flow

Input embedding
Stack of xLSTM Residual Blocks (mixing sLSTM and mLSTM layers)
Output projection (Head)

System Modules

mLSTM Block

High-capacity storage and retrieval using matrix memory

Model or implementation: Pre up-projection block with LayerNorm, Convolution, and mLSTM cell

sLSTM Block

State tracking and control flow via memory mixing

Model or implementation: Post up-projection block with LayerNorm, sLSTM cell, and Gated MLP

Novel Architectural Elements

Exponential gating with stabilizer states (log-space max tracking) to prevent overflow
Matrix memory cell (C_t) replacing scalar cell (c_t) in mLSTM
Hybrid architecture mixing sLSTM (scalar, recurrent) and mLSTM (matrix, parallel) blocks

Modeling

Base Model: xLSTM architecture (stacked residual blocks)

Trainable Parameters: Up to 1.3B parameters in reported experiments

Training Data:

SlimPajama (15B tokens for initial comparison)
SlimPajama (300B tokens for LLM experiments)

Key Hyperparameters:

context_length: 2048
optimizer: AdamW
scheduler: Cosine decay
+ 3 more
precision: Mixed precision (bfloat16)
learning_rate: Not reported in the paper
batch_size: Not reported in the paper

Compute: Trained on GPUs (specific count/type not explicitly listed for training, A100-80GB mentioned for inference speed tests)

Comparison to Prior Work

vs. Mamba: xLSTM retains explicit memory cells and introduces matrix memory; sLSTM keeps memory mixing (recurrence) for state tracking which pure SSMs lack
vs. RWKV: RWKV uses time-mixing/channel-mixing split; xLSTM uses matrix memory with covariance update and exponential gating explicitly derived from LSTM
vs. Transformers: xLSTM has linear complexity O(N) vs Transformer O(N^2) and constant inference cache
+ 1 more
vs. HGRN2 [not cited in paper]: HGRN2 also uses gating and recurrence but lacks the specific matrix covariance update derived specifically to generalize LSTM cells

Limitations

sLSTM component is not parallelizable due to memory mixing (hidden-hidden connections), limiting training speed compared to pure parallel models
Matrix memory operations in mLSTM (d x d) are computationally expensive compared to scalar updates
Initialization of exponential forget gates requires careful tuning to avoid instability
Optimization of hyperparameters for large-scale runs was limited by compute resources

Reproducibility

Code: https://github.com/NX-AI/xlstm

Code is publicly available at https://github.com/NX-AI/xlstm. The paper details the exact update equations and block structures. Hyperparameters for 300B runs are noted as not fully optimized due to compute constraints.

📊 Experiments & Results

Evaluation Setup

Language modeling on SlimPajama; Synthetic tasks for state tracking/associative recall; Downstream common sense reasoning tasks

Benchmarks:

SlimPajama (Next token prediction (Language Modeling))
Multi-Query Associative Recall (Synthetic memory retrieval)
Formal Languages (Chomsky Hierarchy) (State tracking / Grammar recognition)
PALOMA (Language modeling across 571 text domains)
Long Range Arena (LRA) (Long sequence processing)

Metrics:

Perplexity (PPL)
Accuracy (for synthetic/downstream tasks)
Tokens per second (Throughput)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparisons on the 15B token scale show xLSTM outperforming all baselines in validation perplexity.
SlimPajama Validation	Perplexity	13.70	13.43	-0.27
SlimPajama Validation	Perplexity	14.25	13.43	-0.82
Large scale (300B token) experiments demonstrate scaling behavior and superiority on downstream tasks.
SlimPajama Validation	Perplexity	9.14	8.89	-0.25
HellaSwag	Accuracy	57.44	57.83	+0.39
SlimPajama (16k context)	Perplexity	337.83	8.92	-328.91
Synthetic tasks highlight specific architectural advantages in state tracking and memory.
Parity (Regular Language)	Accuracy	0.51	1.0	+0.49
Multi-Query Associative Recall (256 pairs)	Accuracy	0.18	1.0	+0.82

Main Takeaways

xLSTM outperforms state-of-the-art SSMs (Mamba) and Transformers (Llama) on language modeling perplexity at equivalent scales.
The combination of matrix memory (mLSTM) and exponential gating solves the storage capacity and revision limitations of original LSTMs.
xLSTM exhibits strong length extrapolation capabilities, maintaining performance on contexts 8x longer than training without fine-tuning.
The architecture is effective at state-tracking tasks (Parity) where pure SSMs fail, due to the memory mixing in sLSTM blocks.
Inference throughput is significantly higher than Transformers for large batch sizes due to constant memory complexity.

📚 Prerequisite Knowledge

Prerequisites

Recurrent Neural Networks (RNN) and LSTM fundamentals
Transformer architecture (Key-Value-Query mechanisms)
State Space Models (SSM) basics

Key Terms

sLSTM: Scalar LSTM—an updated LSTM with exponential gating and a normalizer state, permitting memory mixing but remaining sequential

mLSTM: Matrix LSTM—a variant using a matrix memory state updated via an outer product rule (covariance update), which is fully parallelizable due to lack of hidden-hidden mixing

exponential gating: Using exp() instead of sigmoid activation for input/forget gates, allowing the model to more aggressively revise or preserve memory states

covariance update rule: An update mechanism where the memory matrix is modified by adding the outer product of a value vector and a key vector (C_t = C_{t-1} + v_t k_t^T)

memory mixing: The interaction between hidden states from different memory cells (or heads) via recurrent weight matrices, crucial for state tracking

xLSTM block: A residual block wrapping either an sLSTM (with post up-projection) or mLSTM (with pre up-projection) into a standard deep learning backbone

BAM: Bidirectional Associative Memory—a type of recurrent network that stores pairs of vectors (keys and values) using correlation matrices

SlimPajama: A large-scale, deduplicated dataset for training large language models, derived from the RedPajama dataset

pre up-projection: A block design where inputs are projected to a high dimension *before* the core mixing/memory operation (used in mLSTM blocks)

post up-projection: A block design where the core operation happens in lower dimension, followed by projection to high dimension and back (used in sLSTM blocks, similar to Transformer FFNs)

FlashAttention: An algorithm that speeds up attention computation and reduces memory usage by optimizing GPU memory reads/writes (referenced here for parallel comparison)

state tracking: The ability of a model to maintain and update the status of entities or variables over time, often required for formal language tasks like parity or dyck languages