kNN-LM: A language model augmented by linearly interpolating its output distribution with a distribution computed from nearest neighbors in a datastore of training examples
over-specification: A phenomenon where training data contains redundant information not causally necessary for the prediction (e.g., unnecessary relative clauses), which confuses the model during inference when that info is missing
softmax bottleneck: The theoretical limitation where the rank of the final linear layer restricts the expressiveness of the probability distributions a model can generate
Macondo: A synthetic dataset created by the authors to test generalization, where relationships (parent-child) are described with irrelevant attributes (e.g., birth year) in training but without them in testing
datastore: A key-value store where keys are vector representations of context from the training set and values are the subsequent target tokens
MLP augmentation: The authors' proposed method of training a Multi-Layer Perceptron to predict the next token from the intermediate representation, replacing the explicit kNN search