Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models

📝 Paper Summary

Memory internalization Domain adaptation

Memory Decoder is a small, standalone transformer trained to mimic external retrieval distributions, which can then be interpolated with any base LLM for instant, efficient domain adaptation without fine-tuning the base model.

Core Problem

Adapting LLMs to specialized domains currently requires a trade-off: either expensive full-parameter retraining (DAPT) that risks forgetting, or high-latency retrieval (RAG) that searches massive external datastores at inference time.

Why it matters:

Full retraining (DAPT) is computationally prohibitive for large models and must be repeated for every new model architecture, wasting resources.
Retrieval-Augmented Generation (RAG) adds significant inference latency due to nearest-neighbor searches and long context processing.
Existing methods lack portability; domain knowledge learned by one model cannot be easily transferred to another without retraining.

Concrete Example: When adapting a general LLM to the biomedical domain, DAPT requires retraining billions of parameters, while RAG must search a massive clinical database for every token generation. Memory Decoder avoids both by carrying the database 'internalized' in a small plugin model.

Key Novelty

Distilling Non-Parametric Retrieval into a Parametric Decoder

Trains a small decoder-only model to predict the output probability distribution of a k-nearest neighbor (kNN) retriever, effectively compressing a large external datastore into model weights.
Enables 'plug-and-play' adaptation: the trained Memory Decoder runs in parallel with *any* frozen LLM (sharing the same tokenizer) and their outputs are linearly interpolated, instantly adapting the LLM to the domain.

Architecture

The dual-stage process of Memory Decoder: (1) Pre-training phase where the decoder learns to mimic kNN distributions from a datastore, and (2) Inference phase where the trained decoder runs in parallel with a frozen PLM.

Evaluation Highlights

Reduces perplexity by an average of 6.17 points across biomedical, financial, and legal domains compared to base models.
A single 0.5B parameter Memory Decoder successfully adapts the entire Qwen2.5 family (ranging from 0.5B to 72B parameters) to the finance domain.
Achieves 1.28x inference latency overhead, significantly faster than kNN-LM (2.17x) and In-Context RAG (1.51x).

Breakthrough Assessment

8/10

Offers a genuinely new paradigm for domain adaptation—portability. The ability to train a memory module once and plug it into models ranging from 0.5B to 72B without retraining the base model is a significant efficiency breakthrough.

⚙️ Technical Details

Problem Definition

Setting: Domain adaptation of a pretrained language model without modifying its parameters.

Inputs: Context sequence x = (x_1, ..., x_{t-1})

Outputs: Next-token prediction distribution optimized for the target domain.

Pipeline Flow

Input Processing (Tokenization)
Dual Parallel Decoding (Base LLM + Memory Decoder)
Distribution Interpolation

System Modules

Base LLM (Dual Parallel Decoding)

Provides general language modeling capabilities.

Model or implementation: Any pretrained LLM (e.g., Qwen2.5, Llama-3) sharing the tokenizer.

Memory Decoder (Dual Parallel Decoding)

Provides domain-specific probability distribution by mimicking a retriever.

Model or implementation: Small Transformer Decoder (e.g., 0.5B parameters)

Interpolator

Combines base and memory distributions.

Model or implementation: Linear interpolation

Novel Architectural Elements

Parallel decoding architecture where a small 'satellite' decoder runs alongside a large frozen LLM specifically to inject domain probabilities.
Decoupled memory training: The memory module is trained independently of the base model's specific weights, relying only on the shared vocabulary/tokenizer.

Modeling

Base Model: Varies (GPT-2, Qwen2.5 family, Llama-3 family)

Training Method: Supervised learning on pre-computed kNN distributions

Objective Functions:

Purpose: Minimize the difference between the Memory Decoder's output and the non-parametric kNN retriever's distribution.

Formally: Distribution Alignment Loss (KL Divergence) L_distill = KL(p_kNN || p_Mem).
Purpose: Maintain linguistic coherence by predicting the next token in the corpus.

Formally: Standard Cross-Entropy Loss L_CE.
Purpose: Combine objectives.

Formally: L = beta * L_distill + (1 - beta) * L_CE.

Training Data:

Datastore construction: (Key, Value) pairs from domain corpus using a specific PLM layer.
Target generation: For each context x_i, perform kNN search (excluding top-1 self-match) to compute p_kNN.
Domains: Biomedicine (MIMIC-III), Finance (Financial news), Law (Asylex).

Key Hyperparameters:

beta: 0.5 (balance between KL and CE loss)
learning_rate: 1e-4 (Qwen experiments), 1e-3 (GPT-2 experiments)
k: Not explicitly defined for kNN construction in text, likely standard kNN-LM setting
+ 1 more
alpha: Tuned on validation split (interpolation weight)

Compute: 8x A800 80GB GPUs. Training budget equivalent to training a 7B model for 1 epoch.

Comparison to Prior Work

vs. kNN-LM: Eliminates the need for massive datastore storage and slow retrieval during inference by distilling the distribution into weights.
vs. DAPT: Does not modify base model parameters and allows one memory module to serve multiple model sizes.
vs. RAG: Significantly lower inference latency (1.28x vs >1.5x) and does not consume context window length.
+ 1 more
vs. LoRA: LoRA is model-specific; Memory Decoder is tokenizer-specific and transfers across model sizes (0.5B to 72B).

Limitations

Requires the base model and Memory Decoder to share the same tokenizer (though cross-vocabulary transfer is possible with re-initialization).
Performance depends on the quality of the underlying kNN distribution used for supervision.
Still adds some computational overhead (small decoder forward pass) compared to a bare model, though less than RAG.

Reproducibility

Code: https://github.com/LUMIA-Group/MemoryDecoder

Publicly available code at https://github.com/LUMIA-Group/MemoryDecoder. Pretrained weights available on HuggingFace. Detailed experimental setup provided, including baselines and hyperparameters.

📊 Experiments & Results

Evaluation Setup

Language modeling (perplexity) and downstream NLP tasks across biomedical, financial, and legal domains.

Benchmarks:

WikiText-103 (General Language Modeling)
MIMIC-III (Biomedical Domain Modeling)
Financial News (Financial Domain Modeling)
Asylex (Legal Domain Modeling)
Various NLP Tasks (SST2, MR, CB, RTE, etc.) (Downstream Zero-shot Evaluation)

Metrics:

Perplexity (PPL)
Average Score (Accuracy/F1 based on task)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Language modeling results on WikiText-103 showing Memory Decoder effectiveness across GPT-2 scales.
WikiText-103	Perplexity	31.09	18.36	-12.73
WikiText-103	Perplexity	19.78	18.36	-1.42
Cross-model adaptation results demonstrating a single Memory Decoder (0.5B) improving the entire Qwen2.5 family on Financial domain.
Finance Domain	Perplexity	11.75	6.87	-4.88
Finance Domain	Perplexity	5.62	5.35	-0.27
Downstream task performance (Zero-shot) comparing preservation of general capabilities.
Average (9 tasks)	Score	50.1	61.3	+11.2
Inference latency comparison.
Inference Speed	Latency Overhead (relative to base)	2.17	1.28	-0.89

Experiment Figures

Perplexity scores of Qwen2.5 models (0.5B to 72B) on Finance domain, with and without Memory Decoder.

Inference latency comparison between Base model, Memory Decoder, In-Context RAG, and kNN-LM.

Main Takeaways

A single Memory Decoder can enhance an entire family of models (e.g., Qwen 0.5B to 72B) without retraining, proving true plug-and-play capability.
The method successfully mitigates catastrophic forgetting observed in DAPT, maintaining or improving performance on general downstream tasks.
Distilling non-parametric kNN distributions into a parametric model captures domain knowledge effectively while removing the heavy storage/compute cost of retrieval.
Cross-vocabulary transfer is possible: a decoder trained for Qwen can adapt to Llama with only 10% of the training budget by re-initializing embeddings.

📚 Prerequisite Knowledge

Prerequisites

Language Modeling (next-token prediction)
k-Nearest Neighbor Language Models (kNN-LM)
KL Divergence
Transformer architecture

Key Terms

DAPT: Domain Adaptive Pre-Training—continuing to train a model on domain-specific data to learn its patterns.

RAG: Retrieval-Augmented Generation—fetching relevant documents from an external database to help a model answer questions.

kNN-LM: k-Nearest Neighbor Language Model—a method that interpolates a model's prediction with tokens retrieved from a datastore of similar contexts.

Non-parametric retriever: A retrieval system (like kNN) that searches explicitly stored data rather than using learned weights.

KL divergence: Kullback-Leibler divergence—a statistical measure quantifying how much one probability distribution differs from another.

Plug-and-play: The ability to add a component to a system (like a memory module to an LLM) without requiring retraining or complex integration.

Perplexity: A measurement of how well a probability model predicts a sample; lower values indicate better prediction.