ROOTS: The 1.6TB multilingual text corpus used to train the BLOOM large language model
BLOOM: A 176B-parameter open-access multilingual language model developed by the BigScience workshop
PII: Personally Identifiable Information—sensitive data like names, emails, or phone numbers that must be protected
BM25: Best Matching 25—a ranking function used by search engines to estimate the relevance of documents to a given search query
Suffix Array: A data structure that stores all suffixes of a string in sorted order, enabling extremely fast exact string matching over large texts
Fuzzy Search: Search that finds matches even if the query words appear in different orders or with slight variations (implemented here via BM25)
Exact Search: Search that finds only precise, character-for-character matches of the query string
Corpus Linguistics: The study of language as expressed in corpora (samples of 'real world' text)
OSCAR: Open Super-large Crawled ALMAnaCH coRpus—a huge multilingual dataset obtained by filtering Common Crawl, used as a sub-component of ROOTS
Pyserini: A Python toolkit for reproducible information retrieval research, used here to build the sparse indices