kNN-LM: k-Nearest Neighbors Language Model—the proposed approach augmenting LMs with retrieval
datastore: A key-value storage containing context embeddings (keys) and target tokens (values) from a text collection
FAISS: A library for efficient similarity search and clustering of dense vectors
perplexity: A measurement of how well a probability model predicts a sample; lower is better (exponentiated negative log-likelihood)
RBF kernel: Radial Basis Function kernel—a similarity function that decreases with distance, used here to convert distances to probabilities
interpolation parameter (lambda): A scalar weight controlling the mix between the standard LM probability and the kNN probability
BPE: Byte-Pair Encoding—a subword tokenization method
Transformer-XL: A Transformer architecture variant optimized for long contexts
continuous cache: A mechanism (Grave et al., 2017c) that stores recent hidden states from the current document to aid in local context copying