kNN-LM: Generalization through memorization: Nearest Neighbor Language models

📝 Paper Summary

Modularized RAG pipeline Internalization through post-training

kNN-LM augments pre-trained language models by linearly interpolating the next-token distribution with a retrieval mechanism that searches a datastore of training examples, improving performance without retraining.

Core Problem

Neural language models struggle to predict rare patterns and factual knowledge (the long tail) because they must implicitly memorize training examples in their parameters.

Why it matters:

Implicit memorization in parameters is inefficient for rare events and factual knowledge, leading to poor generalization on the long tail.
Scaling models often requires massive retraining; current methods lack efficient ways to scale or adapt to new domains without further training.

Concrete Example: A model might know 'Dickens wrote...' implies an author name, but fails to predict 'David Copperfield' specifically. In the paper, a standard LM assigns probability 0.124 to the target 'honour' after a Gallipoli context, while kNN-LM assigns 0.998 by retrieving a near-identical training example.

Key Novelty

k-Nearest Neighbor Language Model (kNN-LM)

Constructs a key-value datastore from training data, where keys are context embeddings and values are target tokens.
During inference, retrieves the nearest neighbors of the test context from the datastore and computes a distribution over their values.
Interpolates this retrieval-based distribution with the standard model's output distribution, allowing explicit memory access without retraining parameters.

Architecture

Illustration of the kNN-LM inference process using a pre-trained LM and a datastore.

Evaluation Highlights

Achieves state-of-the-art perplexity of 15.79 on WikiText-103, a 2.86 point improvement over the base model.
Outperforms a model trained on 3B tokens (15.17 perplexity) by training on just 100M tokens and retrieving from the 3B corpus (13.73 perplexity).
Effective domain adaptation: Adding a 'Books' datastore to a 'Wiki' model improves perplexity on Books from 34.84 to 20.47 without retraining.

Breakthrough Assessment

9/10

Simple yet highly effective method that achieved SOTA without training, demonstrated that retrieval can substitute for massive training data, and influenced subsequent retrieval-augmented generation research.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive language modeling

Inputs: Context sequence of tokens c_t = (w_1, ..., w_{t-1})

Outputs: Probability distribution p(w_t | c_t) over the next token w_t

Pipeline Flow

Pre-trained LM encodes Context -> Context Representation
kNN Search retrieves nearest neighbors from Datastore using Representation
Compute kNN probability distribution from neighbors
Interpolate kNN distribution with LM output distribution

System Modules

Context Encoder

Map input context to a fixed-length vector representation

Model or implementation: Pre-trained Transformer Decoder (16 layers, 1024 dim)

Datastore (Retrieval)

Store all training examples as (Key, Value) pairs

Model or implementation: FAISS Index

Retriever (Retrieval)

Find k nearest neighbors of test context in datastore

Model or implementation: FAISS (L2 distance)

Interpolator

Combine LM and kNN distributions

Model or implementation: Linear Interpolation

Novel Architectural Elements

Inference-only augmentation of pre-trained LMs with a massive external datastore of token-level examples
Substitution of implicit parameter memorization with explicit nearest-neighbor retrieval over training data

Modeling

Base Model: Transformer Decoder (Baevski & Auli, 2019 architecture)

Training Data:

WikiText-103
Toronto Books Corpus (0.7B tokens)
Wiki-3B (2.87B tokens)
Wiki-100M (subset)

Key Hyperparameters:

layers: 16
hidden_dim: 1024
ffn_dim: 4096
+ 4 more
heads: 16
k_neighbors: 1024
interpolation_lambda: 0.25 (WikiText-103)
context_window: 3072 tokens (WikiText-103)

Compute: Building 103M datastore takes ~2 hours on CPU. Validation with k=1024 takes ~25 mins.

Comparison to Prior Work

vs. Continuous Cache: Retrieves from the entire static training set (explicit memory) rather than just the dynamic local context [cited in paper]
vs. Dynamic Evaluation (Krause et al., 2019): Does not update model parameters at test time, only retrieves examples [cited in paper]
vs. BERT [not cited in paper]: Autoregressive generation vs Masked LM; kNN-LM focuses on generation/perplexity

Limitations

Inference speed is slower due to nearest neighbor search over large datastores (linear growth with datastore size without optimization).
Storage requirements are high (keys and values for every token in the corpus).
Performance depends on the quality and domain relevance of the datastore.

Reproducibility

Code: https://github.com/urvashik/knnlm

Publicly available code (https://github.com/urvashik/knnlm). Datasets are standard (WikiText-103, Toronto Books). Uses FAISS library. Base model weights from Baevski & Auli (2019).

📊 Experiments & Results

Evaluation Setup

Language Modeling (Next Token Prediction)

Benchmarks:

WikiText-103 (Language Modeling)
Toronto Books Corpus (Language Modeling (Domain Adaptation))
Wiki-3B (Language Modeling (Scaling))

Metrics:

Perplexity
Statistical methodology: Report median of three random seeds for WikiText-103

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
WikiText-103	Perplexity	18.65	16.12	-2.53
WikiText-103	Perplexity	18.65	15.79	-2.86
Books	Perplexity	11.89	10.89	-1.00
Wiki-3B Test	Perplexity	19.59	13.73	-5.86
Books	Perplexity	34.84	20.47	-14.37

Experiment Figures

Impact of datastore size on perplexity and the optimal interpolation parameter (lambda).

Training curves for Transformer LM with and without dropout.

Main Takeaways

Retrieving neighbors from a large corpus can outperform training on that corpus (100M train + 3B datastore > 3B train).
kNN-LM is particularly effective for rare patterns and factual knowledge (long tail) where implicit memorization fails.
The approach is additive with other techniques like Continuous Cache.
Representation from the input to the final Feed Forward Network (normalized) works best for keys.
Performance improves monotonically with the number of neighbors (k) retrieved (up to 1024 tested).

📚 Prerequisite Knowledge

Prerequisites

Autoregressive Language Modeling
Transformer architecture (Self-Attention, FFN)
Nearest Neighbor Search (kNN)
Perplexity

Key Terms

kNN-LM: k-Nearest Neighbors Language Model—the proposed approach augmenting LMs with retrieval

datastore: A key-value storage containing context embeddings (keys) and target tokens (values) from a text collection

FAISS: A library for efficient similarity search and clustering of dense vectors

perplexity: A measurement of how well a probability model predicts a sample; lower is better (exponentiated negative log-likelihood)

RBF kernel: Radial Basis Function kernel—a similarity function that decreases with distance, used here to convert distances to probabilities

interpolation parameter (lambda): A scalar weight controlling the mix between the standard LM probability and the kNN probability

BPE: Byte-Pair Encoding—a subword tokenization method

Transformer-XL: A Transformer architecture variant optimized for long contexts

continuous cache: A mechanism (Grave et al., 2017c) that stores recent hidden states from the current document to aid in local context copying