Memory Layers at Scale - Paper Summary

📝 Paper Summary

Memory organization Sparse neural networks

Replacing dense feed-forward layers with large-scale, sparsely activated product-key memory layers significantly increases model capacity and factual accuracy without increasing inference FLOPs.

Core Problem

Dense language models couple parameter count directly with computational cost, making it expensive to scale storage for simple associations (facts) using standard feed-forward networks.

Why it matters:

Scaling dense models to store more facts requires prohibitive increases in compute and energy
Memory-bandwidth bound components like sparse memory layers have been underutilized and unoptimized for modern hardware compared to FLOP-bound dense layers
Current alternatives like Mixture-of-Experts (MoE) still resemble dense networks and don't maximize parameter efficiency for pure storage

Concrete Example: A standard dense LLM struggles to recall specific long-tail facts (e.g., a celebrity's birthday) unless scaled to massive sizes, whereas a memory-augmented model can retrieve this from a dedicated sparse layer without activating billions of parameters.

Key Novelty

Scalable Product-Key Memory Layers

Replaces feed-forward network (FFN) layers with a sparse key-value lookup mechanism where keys and values are trainable parameters
Uses product quantization for keys (splitting keys into two sub-keys) to enable efficient top-k retrieval over millions of entries without prohibitive search costs
Implements custom CUDA kernels to overcome PyTorch bandwidth bottlenecks, enabling scaling to 128 billion parameters with high throughput

Evaluation Highlights

+100% improvement in factual accuracy on QA benchmarks compared to dense baselines
Outperforms dense models trained with >2x the compute budget on downstream tasks
Surpasses Mixture-of-Experts (MoE) models when matched for compute and parameter count, particularly on factual tasks

Breakthrough Assessment

8/10

Demonstrates a successful scaling of memory layers to 100B+ parameters with actual hardware acceleration, proving they are a viable alternative to MoE for scaling capacity without FLOPs.

⚙️ Technical Details

Problem Definition

Setting: Language modeling with augmented capacity for factual storage

Inputs: Token embedding from previous attention layer

Outputs: Updated token embedding after memory lookup and aggregation

Pipeline Flow

Transformer Attention Layer
Memory Layer (replacing FFN)
Next Transformer Layer

System Modules

Product-Key Lookup (Memory Retrieval)

Identify top-k relevant memory slots efficiently

Model or implementation: Trainable product keys (K1, K2)

Value Aggregation (Memory Retrieval)

Retrieve and combine values associated with top-k keys

Model or implementation: Trainable value embeddings (sharded across GPUs)

Gating Mechanism

Modulate the memory output before adding to residual stream

Model or implementation: SiLU activation

Novel Architectural Elements

Replacement of FFN with Product-Key Memory Layer at scale (up to 128B params)
Shared memory pool across multiple transformer layers (parameter sharing)
Custom CUDA kernels for EmbeddingBag to maximize memory bandwidth utilization (3TB/s vs 400GB/s)

Modeling

Base Model: Llama-2 architecture (134M to 1.3B) and Llama-3 (8B)

Trainable Parameters: Memory keys and values are end-to-end trainable alongside base model

Training Data:

Pretraining data mix similar to Llama 2 (for smaller models)
Optimized data mix similar to Llama 3 (for 8B model)
Trained to 1T tokens

Key Hyperparameters:

max_context_length: 32k (Llama 2 base), 128k (Llama 3 base)
memory_capacity: Up to 128 billion parameters
token_count: 1 trillion tokens

Compute: Trainable memory layers optimized to reach 3TB/s memory bandwidth on H100 GPUs

Comparison to Prior Work

vs. MoE: Memory layers are purely key-value lookups without FFN computation in the sparse branch; Memory layers outperform MoE on factual tasks at matched compute/params.
vs. PEER: PEER retrieves rank-1 matrices; this work focuses on scaling simpler vector value retrieval to massive scales (128B).
vs. PKN: This work scales to 128B params, introduces shared memory across layers, improved gating, and custom kernels for bandwidth saturation.

Limitations

Memory layers are memory-bandwidth bound, requiring custom kernels for efficiency
Gains diminish after replacing more than ~3 FFN layers, suggesting dense and sparse layers are complementary
Training stability can be an issue for large memory layers with small base models (requires QK-norm)

Reproducibility

Code: https://github.com/facebookresearch/memory

Code is publicly available at https://github.com/facebookresearch/memory. Training data mixes are described as similar to Llama 2/3 but exact datasets not provided. Custom CUDA kernels are crucial for replication and provided in the repo.

📊 Experiments & Results

Evaluation Setup

Pretraining followed by zero-shot or few-shot evaluation on downstream tasks

Benchmarks:

NaturalQuestions (Factual QA)
TriviaQA (Factual QA)
HotpotQA (Multi-hop QA)
MMLU (General Knowledge)
HumanEval (Coding)

Metrics:

Exact Match (EM)
F1 score
Pass-at-1
Negative Log-Likelihood (NLL)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Memory layers significantly outperform dense baselines on factual QA tasks.
NaturalQuestions	Exact Match	Not reported in the paper	Not reported in the paper	Not reported in the paper

Experiment Figures

Parallel memory lookup and aggregation across GPUs

Main Takeaways

Memory augmented models outperform dense models with >2x computation budget on downstream tasks.
Gains are most pronounced on factual tasks (NaturalQuestions, TriviaQA), confirming the hypothesis that memory layers store associations efficiently.
Memory layers outperform MoE models when matched for compute and parameter size, especially on factuality.
Scaling laws hold: performance improves with memory size up to 128B parameters.
Sharing memory parameters across multiple layers improves performance compared to single-layer memory, but replacing too many FFN layers degrades it.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Feed-Forward Networks)
Mixture of Experts (MoE)
Product Quantization / Approximate Nearest Neighbor Search
GPU Memory Bandwidth vs. FLOPs

Key Terms

Product-key memory: A memory retrieval mechanism where a large set of keys is represented as the cartesian product of two smaller sub-key sets, allowing efficient search.

EmbeddingBag: A PyTorch operation that computes sums or means of lookup table entries; optimized here with custom CUDA kernels.

SiLU: Sigmoid Linear Unit—an activation function x * sigmoid(x) used for gating the memory output.

QK-normalization: A technique to normalize queries and keys to improve training stability in large-scale attention or memory layers.

Mixture-of-Experts (MoE): A sparse architecture where different parts of the network (experts) are activated for different inputs.

FLOPs: Floating Point Operations per Second—a measure of computational cost.