← Back to Paper List

Memory Layers at Scale

Vincent-Pierre Berges, Barlas Oğuz, Daniel Haziza, Wen-tau Yih, Luke Zettlemoyer, Gargi Ghosh
Meta Fundamental AI Research
arXiv
Memory Pretraining Factuality QA

📝 Paper Summary

Memory organization Sparse neural networks
Replacing dense feed-forward layers with large-scale, sparsely activated product-key memory layers significantly increases model capacity and factual accuracy without increasing inference FLOPs.
Core Problem
Dense language models couple parameter count directly with computational cost, making it expensive to scale storage for simple associations (facts) using standard feed-forward networks.
Why it matters:
  • Scaling dense models to store more facts requires prohibitive increases in compute and energy
  • Memory-bandwidth bound components like sparse memory layers have been underutilized and unoptimized for modern hardware compared to FLOP-bound dense layers
  • Current alternatives like Mixture-of-Experts (MoE) still resemble dense networks and don't maximize parameter efficiency for pure storage
Concrete Example: A standard dense LLM struggles to recall specific long-tail facts (e.g., a celebrity's birthday) unless scaled to massive sizes, whereas a memory-augmented model can retrieve this from a dedicated sparse layer without activating billions of parameters.
Key Novelty
Scalable Product-Key Memory Layers
  • Replaces feed-forward network (FFN) layers with a sparse key-value lookup mechanism where keys and values are trainable parameters
  • Uses product quantization for keys (splitting keys into two sub-keys) to enable efficient top-k retrieval over millions of entries without prohibitive search costs
  • Implements custom CUDA kernels to overcome PyTorch bandwidth bottlenecks, enabling scaling to 128 billion parameters with high throughput
Evaluation Highlights
  • +100% improvement in factual accuracy on QA benchmarks compared to dense baselines
  • Outperforms dense models trained with >2x the compute budget on downstream tasks
  • Surpasses Mixture-of-Experts (MoE) models when matched for compute and parameter count, particularly on factual tasks
Breakthrough Assessment
8/10
Demonstrates a successful scaling of memory layers to 100B+ parameters with actual hardware acceleration, proving they are a viable alternative to MoE for scaling capacity without FLOPs.
×