← Back to Paper List

On Retrieval Augmentation and the Limitations of Language Model Training

TR Chiang, XV Yu, J Robinson, O Liu, I Lee…
University of Southern California
arXiv, 11/2023 (2023)
RAG Factuality Benchmark

📝 Paper Summary

Language Model Generalization Retrieval-Augmented Generation (RAG)
The performance gap between vanilla and kNN-augmented LMs is caused not by softmax bottlenecks but by the vanilla LM's inability to generalize from over-specified training data containing redundant information.
Core Problem
Vanilla language models fail to generalize when training data contains 'over-specification'—redundant information that is not causally relevant to the prediction—whereas kNN-augmented models handle this robustly.
Why it matters:
  • Real-world training data often contains redundant details (e.g., 'I was drunk *when I left the party*'), confusing models about causal relationships
  • This limitation persists even in large models like GPT-3.5 Turbo, suggesting scaling alone cannot solve the generalization failure caused by over-specification
  • Understanding this gap reveals why retrieval augmentation (kNN-LM) improves perplexity even when retrieving from the exact same training data used to train the model
Concrete Example: A model trained on '[villager], who was born in 1990, is the parent of [child]' fails to predict [child] when tested on the simpler prompt '[villager] is the parent of [child]' because it relies on the irrelevant birth year information, whereas a kNN-LM retrieves the correct continuation.
Key Novelty
Over-specification Hypothesis & Macondo Dataset
  • Disproves the 'softmax bottleneck' hypothesis by showing that linear projections of the last layer can approximate kNN-LM distributions well
  • Identifies 'over-specification' (redundant non-causal info in prompts) as a key cause of LM generalization failure
  • Proposes replacing the memory-intensive kNN datastore with a trained MLP that maps context representations directly to target tokens, retaining generalization benefits with far less storage
Evaluation Highlights
  • kNN-LM achieves significantly lower perplexity than vanilla GPT-2 XL on the Macondo dataset (synthetic over-specification task), closer to the theoretical lower bound
  • Proposed MLP augmentation matches kNN-LM generalization on Macondo while using >25x less storage
  • On WikiText, MLP augmentation reduces perplexity by 1.45 compared to vanilla LM, using less than 4% of the kNN datastore size
Breakthrough Assessment
7/10
Provides strong negative results for the popular 'softmax bottleneck' theory and identifies a fundamental 'over-specification' failure mode in LMs. The proposed MLP solution is a practical efficiency improvement.
×