$\text{Memory}^3$: Language Modeling with Explicit Memory

📝 Paper Summary

Memory internalization Sparse memory QA Retrieval

Memory3 introduces explicit memory—sparse attention key-values retrieved during inference—to externalize knowledge from model parameters, enabling smaller models to outperform larger ones with higher speed and lower cost.

Core Problem

LLMs suffer from high training/inference costs due to scaling laws and 'knowledge traversal', where the entire massive parameter set is activated to access small pieces of specific knowledge.

Why it matters:

Current LLMs have extremely low knowledge efficiency (estimated around 10^-5), wastefully invoking all parameters for every token generation
Storing all knowledge in implicit memory (parameters) forces models to be massive, increasing training data requirements and energy consumption
Standard RAG approaches rely on heavy backbones and don't solve the knowledge traversal issue of the underlying model

Concrete Example: When a human writes a word, they don't recall every book they've ever read; however, an LLM activates all its parameters (its full 'brain') for every token. This is like forcing knowledge into muscle memory (implicit) rather than just recalling a book (explicit).

Key Novelty

Externalizing knowledge into 'Explicit Memory' (sparse attention key-values)

Treats knowledge storage as a hierarchy: RAG (text) → Explicit Memory (sparse KV pairs) → Implicit Memory (parameters), optimizing for read/write costs
Converts text data into sparse key-value pairs off-line, which are then retrieved and injected directly into the model's self-attention layers during inference
Uses a two-stage pretraining scheme: first training the model to use memory (Auto-Encoding), then training it to generate text (Auto-Regressive)

Architecture

The workflow of Memory3, contrasting it with RAG and Parametric models. It shows the offline conversion of text to Explicit Memory and the online retrieval process.

Evaluation Highlights

Memory3-2.4B outperforms the larger Baichuan2-7B-Base on the C-Eval benchmark (56.0 vs 54.0) despite having ~3x fewer parameters
Achieves 1.66x higher decoding speed compared to a RAG baseline (Baichuan2-7B + Faiss retrieval)
Reduces hallucination rate significantly in professional domains (e.g., medical), achieving 70.8% accuracy on unrecalled medical questions vs. 32.0% for the baseline

Breakthrough Assessment

8/10

Proposes a fundamental architectural shift by externalizing parameters into retrievable KV memory. The 2.4B model beating 7B baselines validates the efficiency of this 'third form' of memory.

⚙️ Technical Details

Problem Definition

Setting: Language modeling where knowledge is stored in an externalized, retrievable format rather than solely in model weights

Inputs: Input context sequence

Outputs: Next token prediction probability distribution

Pipeline Flow

Memory Creation: Knowledge Base → Encoder → Sparse KV Memories (Offline)
Inference: Input → Retriever (MIPS) → Top-k Explicit Memories → Attention Layers → Output

System Modules

Memory Encoder

Converts raw text into explicit memory format (sparse key-values)

Model or implementation: Memory3-2.4B (same backbone used for inference)

Retriever (Inference)

Selects relevant explicit memories for the current context

Model or implementation: Faiss (MIPS index)

Memory3 Backbone (Inference)

Integrates retrieved memories into self-attention to generate tokens

Model or implementation: 2.4B Parameter Transformer

Novel Architectural Elements

Explicit Memory format: Storing knowledge as retrievable, sparse attention key-values rather than raw text or model weights
Memory Sparsification: A mechanism to prune less important KV pairs to make external storage tractable

Modeling

Base Model: Memory3-2.4B (Custom architecture trained from scratch)

Training Method: Two-stage pretraining: (1) Auto-Encoding (AE) for memory formation, (2) Auto-Regressive (AR) for generation

Objective Functions:

Purpose: Train the model to utilize explicit memory by reconstructing masked text.

Formally: Masked Language Modeling (MLM) loss during AE stage
Purpose: Train the model to generate text using retrieved memories.

Formally: Standard causal language modeling loss (Next Token Prediction) during AR stage

Adaptation: Pretraining from scratch

Trainable Parameters: 2.4 billion non-embedding parameters

Training Data:

Pretraining: 2.3T tokens from corrupted datasets (RedPajama, Dolma, etc.)
Finetuning/Memory: Specific domain datasets (e.g., medical corpora) encoded into memory

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: Not reported in the paper
sequence_length: Not explicitly reported in the paper
+ 1 more
sparsity_rate: Typically 90% (keeping top-10% values)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Retro: Memory3 retrieves pre-computed KV pairs (Explicit Memory) to avoid real-time encoding cost, whereas Retro encodes retrieved text on the fly
vs. RAG: Memory3 integrates knowledge at the attention layer level via KV pairs, offering higher throughput than processing long text prefixes
vs. Memorizing Transformer: Memory3 handles massive external knowledge bases via sparsification, whereas Memorizing Transformer is limited to recent context history
+ 1 more
vs. MoE: Memory3 externalizes knowledge completely, whereas MoE still stores knowledge in internal parameters (experts)

Limitations

Storage overhead for explicit memory can be large (reduced by sparsification but still significant compared to parameters alone)
Requires a two-stage pretraining process (AE then AR) which is more complex than standard causal training
Retrieval latency (MIPS) adds a constant overhead to inference, though lower than RAG encoding costs

Reproducibility

Code: https://github.com/Y-H-K/Memory3

Code is publicly available at https://github.com/Y-H-K/Memory3. The paper describes the architecture and training stages, but specific hyperparameters like learning rate schedules or exact batch sizes are not detailed in the text.

📊 Experiments & Results

Evaluation Setup

Comparison against standard LLMs and RAG baselines on general and domain-specific tasks

Benchmarks:

C-Eval (Chinese General Knowledge Evaluation)
MMLU (Multi-task Language Understanding)
CMMLU (Chinese Multi-task Language Understanding)

Metrics:

Accuracy (0-100)
Perplexity
Decoding Speed (tokens/second)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Memory3-2.4B is compared against larger baselines (7B-13B parameters) on standard benchmarks to demonstrate parameter efficiency.
C-Eval	Accuracy	54.0	56.0	+2.0
MMLU	Accuracy	54.1	45.0	-9.1
CMMLU	Accuracy	57.0	56.7	-0.3
Inference Latency	Speed (relative)	1.0	1.66	+0.66
Medical QA (CMemb)	Accuracy (Unrecalled)	32.0	70.8	+38.8

Experiment Figures

A cost analysis comparing RAG, Explicit Memory, and Parametric Memory based on knowledge usage frequency.

Main Takeaways

Explicit memory allows a 2.4B model to rival or exceed 7B models on knowledge-intensive benchmarks (C-Eval, CMMLU)
Memory3 offers a speed advantage over standard RAG because retrieved memories are pre-encoded KV pairs, bypassing the heavy context encoding step
The architecture significantly improves factuality in specialized domains (medical) by relying on explicit retrieval rather than parametric hallucination
Sparsification makes the external memory storage feasible without severe performance degradation

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Self-Attention, Key-Value caching)
Retrieval-Augmented Generation (RAG) concepts
Basic understanding of sparse attention mechanisms

Key Terms

Implicit Memory: Knowledge stored within the neural network's trainable parameters (weights)

Working Memory: The transient state stored in the context key-value cache during the processing of the current input sequence

Explicit Memory: The proposed format: sparse attention key-value pairs derived from text and stored externally, retrieved during inference

Knowledge Traversal: The inefficiency where an LLM activates all its parameters (and thus all stored knowledge) just to generate a single token

Top-k: A selection algorithm that keeps only the 'k' elements with the highest scores

MIPS: Maximum Inner Product Search—a technique to find vectors in a database that have the highest dot product with a query vector

Faiss: A library for efficient similarity search and clustering of dense vectors

RAG: Retrieval-Augmented Generation—enhancing models by retrieving relevant text chunks before generation

Perplexity: A measurement of how well a probability model predicts a sample; lower values indicate better performance

KV Cache: Key-Value Cache—storing previous calculations in Transformers to speed up sequential generation