Efficient Nearest Neighbor Language Models

📝 Paper Summary

Modularized RAG pipeline Efficient Retrieval

This paper proposes three strategies—adaptive retrieval, datastore pruning, and dimension reduction—to speed up k-nearest neighbor language models (kNN-LM) by up to 6x while maintaining performance.

Core Problem

Non-parametric models like kNN-LM require retrieving from massive datastores (billions of tokens) at every generation step, causing significant inference latency and storage overhead.

Why it matters:

The high computational cost of dense retrieval at every step prevents the deployment of effective non-parametric models in real-world applications
Storing full-dimension vectors for every token in a large corpus creates unmanageable memory footprints (e.g., hundreds of GBs)
Retrieval is often unnecessary for easy predictions, yet standard kNN-LM forces costly retrieval operations regardless of difficulty

Concrete Example: In standard kNN-LM, generating a common word like 'the' triggers a search through a 103-million-entry datastore (WikiText-103), taking ~200ms, whereas the base neural model could predict it instantly without retrieval. This accumulates to prohibitively slow generation speeds (e.g., ~50 tokens/sec vs ~1800 tokens/sec for vanilla LM).

Key Novelty

Efficient kNN-LM via Datastore Pruning and Adaptive Retrieval

Adaptive Retrieval: Lightweight classifier decides *when* to retrieve based on the vanilla LM's confidence, skipping retrieval for easy tokens
Datastore Pruning: Removing redundant datastore entries by clustering vectors and keeping only representative centroids or by selecting tokens where the kNN-LM provides high gain
Dimension Reduction: Projecting high-dimensional retrieval keys into a lower-dimensional space using PCA to speed up distance calculations

Architecture

Overview of the Efficient kNN-LM pipeline incorporating pruning, dimension reduction, and adaptive retrieval.

Evaluation Highlights

Achieves up to 6x inference speedup compared to vanilla kNN-LM on WikiText-103 while maintaining comparable perplexity
Reduces datastore size by 95% (pruning) with negligible performance loss compared to the full datastore
Adaptive retrieval skips neighbor search for up to 80% of tokens without significant perplexity degradation

Breakthrough Assessment

7/10

Provides practical, effective solutions to the major bottleneck of kNN-LMs (speed/storage). While the techniques (PCA, pruning) are standard, their application and combination in this specific context enable the deployment of previously impractical models.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive language modeling with access to an external datastore of context-target pairs

Inputs: Context sequence of tokens c_t

Outputs: Next token probability distribution p(w | c_t)

Pipeline Flow

Input Context → Base NLM → [Adaptive Retrieval Check]
If Retrieve: Context Vector → [Dimension Reduction] → [Pruned Datastore Search] → kNN Distribution
Base NLM Distribution + kNN Distribution (Interpolated) → Output
If Skip: Base NLM Distribution → Output

System Modules

Base NLM

Compute context representation and initial token probabilities

Model or implementation: Transformer (fairseq implementation)

Adaptive Retriever (Retrieval & Selection)

Decide whether to perform retrieval for the current token

Model or implementation: Lightweight classifier (single linear layer) on top of frozen NLM

Dimensionality Reducer (Retrieval & Selection)

Compress high-dimensional keys for faster distance calculation

Model or implementation: PCA Matrix

Datastore Search (Retrieval & Selection)

Find nearest neighbors in the pruned datastore

Model or implementation: FAISS Index (Pruned)

Novel Architectural Elements

Adaptive Retrieval mechanism that conditions retrieval on NLM confidence features (max prob, min top-k prob)
Integration of PCA-based dimensionality reduction specifically for kNN-LM retrieval keys
Two specific pruning strategies for kNN-LM datastores: clustering-based (k-means) and performance-based (keeping entries that correct NLM errors)

Modeling

Base Model: Standard Transformer LM (24 layers, 1024 hidden size, 16 heads) for WikiText-103

Training Method: Adaptive retrieval classifier training & PCA matrix calculation

Objective Functions:

Purpose: Train adaptive classifier to identify tokens needing retrieval.

Formally: Binary classification minimizing cross-entropy where label y=1 if p_kNN(w) > p_NLM(w) + margin, else y=0.

Training Data:

WikiText-103 (103M tokens)
1B Word (training set subset for domain adaptation)

Key Hyperparameters:

k_neighbors: 1024
lambda_interpolation: 0.25 (WikiText-103)
temperature_knn: 10-100 (tuned on dev set)
+ 2 more
pca_dimensions: 512 (typically)
pruning_rate: 95% (typical setting in experiments)

Compute: Single NVIDIA GeForce GTX 1080 Ti for inference speed measurement; FAISS for retrieval

Comparison to Prior Work

vs. Vanilla kNN-LM: Introduces pruning, PCA, and adaptive retrieval to trade small perplexity bits for large speedups (up to 6x)
vs. Neural Cache: Uses a massive static offline datastore rather than just a local context cache
vs. RETRO [not cited in paper]: RETRO retrieves chunks during encoding/training; Efficient kNN-LM is a post-hoc inference optimization for autoregressive decoding

Limitations

Pruning requires a pre-processing step that can be computationally expensive (e.g., clustering 100M vectors)
Performance-based pruning depends on the alignment between the specific validation set used for pruning and the test set
Speedups are hardware dependent (CPU vs GPU retrieval bottlenecks)
Adaptive retrieval threshold needs tuning to balance speed and perplexity

Reproducibility

Code: https://github.com/jxhe/efficient-knnlm

Code is publicly available at https://github.com/jxhe/efficient-knnlm. Datastore construction uses standard FAISS libraries. Hyperparameters for all datasets (WikiText-103, 1B Word) are provided.

📊 Experiments & Results

Evaluation Setup

Language modeling perplexity and inference speed evaluation on standard benchmarks

Benchmarks:

WikiText-103 (Language Modeling)
1B Word (subsets) (Domain Adaptation (Law, Medical, IT, etc.))

Metrics:

Perplexity (PPL)
Inference Speed (tokens/sec)
Speedup (x times baseline)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results demonstrating the trade-off between perplexity and speed using the proposed efficiency methods.
WikiText-103	Perplexity	16.12	16.30	+0.18
WikiText-103	Inference Speed (tokens/s)	50	300	+250
WikiText-103	Perplexity	16.12	16.15	+0.03
WikiText-103	Perplexity	16.12	16.27	+0.15
WikiText-103	Perplexity	16.12	16.14	+0.02

Experiment Figures

Detailed Pareto frontiers for each individual technique (Pruning, Adaptive, PCA) showing Perplexity vs. Speed/Datastore Size.

Main Takeaways

Datastore pruning is highly effective: removing up to 90-95% of the datastore results in negligible perplexity increases, suggesting immense redundancy in token representations.
Adaptive retrieval works because kNN retrieval mostly helps with 'hard' tokens (where the NLM is uncertain); skipping retrieval for confident predictions saves time without hurting quality.
Combining all three methods (Pruning, Adaptive Retrieval, PCA) yields the best Pareto frontier for speed vs. accuracy.
The methods generalize to domain adaptation settings (e.g., training on 1B Word, adapting to Law/Medical text), reducing the overhead of utilizing external domain data.

📚 Prerequisite Knowledge

Prerequisites

k-Nearest Neighbors Language Models (kNN-LM)
Dense Retrieval / Vector Search (FAISS)
Principal Component Analysis (PCA)
Language Model Perplexity

Key Terms

kNN-LM: A language model that interpolates probabilities from a neural network with probabilities derived from retrieving similar contexts from a training datastore

datastore: A key-value store where keys are vector representations of context and values are the subsequent target tokens

PCA: Principal Component Analysis—a technique to reduce the number of dimensions in data while retaining as much variation as possible

FAISS: A library for efficient similarity search and clustering of dense vectors

perplexity: A measurement of how well a probability model predicts a sample; lower values indicate better performance

inference overhead: The extra time and computation required to generate text compared to a standard model, often due to retrieval steps

parametric LM: A standard neural language model where knowledge is stored entirely in the model weights

non-parametric LM: A model that references external data (examples) at test time, explicitly memorizing training points