MinHash: A technique for estimating the similarity of two sets (like text documents) by hashing their constituent elements (shingles) and comparing the minimum hash values
LSH: Locality-Sensitive Hashing—an algorithm that hashes similar input items into the same 'buckets' with high probability, used to approximate nearest-neighbor search efficiently
MixMinMatch: The authors' proposed 3-stage pipeline: Mix corpora, MinHash deduplicate, Match based on cross-source counts
Shingles: Short, overlapping sequences of characters (n-grams) used to represent a document for similarity comparison
Jaccard similarity: A statistic used for comparing the similarity and diversity of sample sets (size of intersection divided by size of union)
nanotron: A library for training Large Language Models (LLMs) efficiently
C4: Colossal Clean Crawled Corpus—a massive dataset of web text used for training language models
HPLT: High Performance Language Technologies project dataset—a large multilingual web corpus