Memory Mosaics at scale - Paper Summary

📝 Paper Summary

Alternative Architectures to Transformers In-Context Learning Long-Context Modeling

Memory Mosaics v2 scales networks of associative memories to 8B parameters by introducing adaptive bandwidth and hierarchical memory, significantly outperforming Transformers on new-task adaptation and long-context retrieval.

Core Problem

Transformers exhibit in-context learning capabilities that are poorly understood and often degrade when given many examples (shots), while their reliance on position encoding limits context extrapolation.

Why it matters:

Transformers are opaque, making it difficult to understand how composition or disentanglement occurs during learning
Standard attention mechanisms struggle to extrapolate to longer context lengths without extensive fine-tuning
Current models often fail to effectively utilize large numbers of in-context examples, sometimes performing worse as more data is provided

Concrete Example: In a many-shot classification task (e.g., Banking77), a standard Transformer's accuracy actually decreases as the number of demonstration examples increases beyond a certain point. In contrast, Memory Mosaics v2 consistently improves accuracy with more shots.

Key Novelty

Memory Mosaics v2 (Hierarchical Associative Memory Network)

Replaces Transformer attention with 'Associative Memories' that use symmetric kernels and adaptive bandwidths, treating context as a key-value store without explicit position encoding
Separates memory into three distinct levels: Short-term (recent tokens), Long-term (distant tokens), and Persistent (global knowledge/FFN replacement) to handle different signal dependencies
Introduces a gated time-variant key extractor that creates input-dependent keys, unlike the fixed averaging used in previous versions

Architecture

The architecture of Memory Mosaics v2 compared to v1, detailing the split into Short-term and Long-term associative memories.

Evaluation Highlights

Outperforms Transformers by 12.3% to 14.8% on 'multi-unrelated-documents' QA tasks (Ruler benchmark) at 32k context length
Achieves >10% higher accuracy than Transformers on in-context classification tasks (Banking77, Tacred) and avoids performance degradation with more shots
Matches Transformer performance on 13 standard persistent-knowledge benchmarks (e.g., MMLU) while offering superior interpretability and context extrapolation

Breakthrough Assessment

8/10

Successfully scales a non-Transformer architecture to 8B parameters with competitive performance on standard tasks and superior performance on in-context/long-context tasks. Offers a transparent alternative to attention.

⚙️ Technical Details

Problem Definition

Setting: Language modeling and in-context learning where the model must predict next tokens based on retrieval from local context and global weights

Inputs: Sequence of tokens x (context window)

Outputs: Predicted next token probability distribution

Pipeline Flow

Gated Key Extractor (Generates keys from input tokens)
Memory Layer: Short-term Memory (Recent tokens) + Long-term Memory (Distant tokens)
Persistent Memory (Global knowledge retrieval via FFN)
Output Projection

System Modules

Gated Key Extractor

Generates query/key vectors from input tokens using input-dependent gating

Model or implementation: Recurrent-style gated averaging

Short-term Memory (Contextual Memory)

Stores and retrieves key-value pairs from the recent local context window

Model or implementation: Associative Memory (Gaussian Kernel)

Long-term Memory (Contextual Memory)

Stores and retrieves key-value pairs from the distant past context

Model or implementation: Associative Memory (Gaussian Kernel)

Persistent Memory

Stores global training knowledge shared across all sequences

Model or implementation: 2-layer Dense Network with SwiGLU

Novel Architectural Elements

Replacement of Attention with symmetric Associative Memory blocks using Gaussian kernels
3-Level Memory Hierarchy: Short-term (local), Long-term (distant), Persistent (FFN)
Adaptive Bandwidth: Kernel bandwidth beta scales automatically with the number of stored items n

Modeling

Base Model: Memory Mosaics v2 Large (Llama-8b scale)

Training Method: Pretraining on token prediction followed by long-context fine-tuning

Objective Functions:

Purpose: Minimize prediction error.

Formally: Standard Next Token Prediction (Cross Entropy Loss)

Adaptation: Fine-tuning on 32k context length

Trainable Parameters: 8 Billion

Training Data:

1 Trillion tokens of diverse real-world datamix

Key Hyperparameters:

layers: 32
hidden_dimensions: 4096
heads: 32
+ 5 more
short_term_window_h: 256
long_term_delay_m_train: [64, 256]
long_term_delay_m_inference: 64
context_length_pretrain: 4096
context_length_finetune: 32768

Compute: Not reported in the paper

Comparison to Prior Work

vs. Transformers: Uses symmetric kernel regression instead of Softmax attention; separates short/long memory explicitly; no position encodings
vs. RNNs/SSMs: Stores full key-value pairs (non-parametric) rather than compressing history into a fixed state, allowing perfect recall of 'needle-in-haystack' data
vs. Local Window Attention: Extrapolates to long contexts better (4x-8x) without fine-tuning due to lack of position encoding and adaptive bandwidth

Limitations

Computational cost during inference scales with sequence length (similar to Attention) unlike RNNs
Requires storage of all past key-value pairs (high memory footprint compared to SSMs)
Performance on some specific benchmarks (6 out of 19) degrades significantly if long-term memory is removed, indicating reliance on full context
No explicit position encoding means position information must be inferred from the short-term memory structure

Reproducibility

Code: https://github.com/facebookresearch/MemoryMosaics

Code is publicly available at https://github.com/facebookresearch/MemoryMosaics. Training data mixture details are not fully specified beyond 'diverse datamix'. Hyperparameter initialization for bandwidth is provided in Appendix.

📊 Experiments & Results

Evaluation Setup

Pretraining on 1T tokens followed by evaluation on standard NLP benchmarks, long-context QA, and in-context classification

Benchmarks:

MMLU / Common Benchmarks (Persistent Knowledge Retrieval)
Ruler (Multi-unrelated-documents QA (Long Context))
Banking77, Tacred, Goemotion (In-Context Classification)

Metrics:

Accuracy
Performance improvement (delta)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on New Knowledge Storage (Long Context QA) tasks from the Ruler benchmark.
Ruler (4k context)	Accuracy Improvement	0.0	1.4	+1.4
Ruler (32k context)	Accuracy Improvement	0.0	12.3	+12.3
In-Context Learning (ICL) capability on classification tasks.
Banking77 / Tacred / Goemotion (Avg)	Accuracy Improvement	0.0	10.0	+10.0
Persistent Knowledge capabilities on standard benchmarks.
13 Common NLP Benchmarks	Accuracy	Comparable	Comparable	0.0

Experiment Figures

Comparison of In-Context Learning performance (Accuracy vs. Number of Shots) on semantic classification tasks.

Main Takeaways

Memory Mosaics v2 matches Transformers on standard knowledge tasks but significantly outperforms them on tasks requiring new knowledge adaptation (ICL) and long context.
In-context learning performance monotonically improves with more examples (shots) for Memory Mosaics, whereas Transformers often degrade with more shots.
The stochastic long-term memory delay during training is crucial, improving context-length extrapolation by over 15%.
The 'Short-term' vs 'Long-term' memory split naturally aligns with position-dependent vs position-invariant signals, offering interpretability.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Transformer architecture (Attention, FFN)
Basic knowledge of Kernel Regression / Density Estimation
Familiarity with In-Context Learning (ICL) concepts

Key Terms

Associative Memory: A component that stores key-value pairs and retrieves values via kernel regression (weighted sum based on key similarity), replacing Attention

Memory Mosaics: A neural architecture constructed from networks of associative memories instead of attention heads

Kernel Regression: A non-parametric statistical method used here for memory retrieval, estimating values by smoothing over stored examples with a bandwidth parameter

Bandwidth: A parameter (beta) in the Gaussian kernel that controls how sharp or broad the retrieval focus is; analogous to temperature in softmax

Induction Head: A mechanism where a model learns to copy the token that followed a specific pattern in the past (e.g., if A follows B, predict A next time B appears)

Persistent Memory: The component in Memory Mosaics replacing the Feed-Forward Network (FFN), representing global static knowledge stored in weights

SwiGLU: Swish-Gated Linear Unit, a widely used activation function in modern LLMs (like Llama) for feed-forward layers

Ruler benchmark: A benchmark for evaluating long-context models using tasks with high information entropy, such as 'needle-in-a-haystack' variations