MassiveDS: A 1.4 trillion-token open-source datastore constructed from diverse web and domain-specific sources (books, code, papers) for retrieval scaling research
RIC-LM: Retrieve-in-context Language Models—models that augment generation by prepending retrieved documents to the input context without architectural modification
compute-optimal scaling: Analysis determining the best allocation of computational budget (FLOPs) between model size, pretraining data, and datastore size to maximize performance
Contriever: A dense retrieval model trained using contrastive learning to match queries with relevant documents
perplexity: A measurement of how well a probability model predicts a sample; lower values indicate better performance
FLOPs: Floating Point Operations—a measure of computer performance and computational cost used here to compare training vs. indexing efficiency
subsampling: The process of randomly selecting a fraction of the full datastore to simulate smaller datastore sizes for scaling analysis
reranking: A second stage in retrieval where a more expensive model re-scores the initial set of retrieved documents to improve relevance