SBP: Synthetic Bootstrapped Pretraining—the proposed method of training a synthesizer on document pairs to generate new pretraining data
Inter-document correlation: Semantic or structural relationships between separate documents (e.g., a book and its screenplay) often ignored by standard pretraining
Synthesizer-tuning: Training a language model to maximize the conditional probability of a target document given a related source document
ANN: Approximate Nearest Neighbor—an efficient algorithm to find similar vectors in high-dimensional space, used here to pair documents
Oracle baseline: A hypothetical upper-bound model trained with access to significantly more (e.g., 20x) unique real data than the constrained setup
Repetition baseline: A standard baseline in data-constrained settings where the model simply re-trains on the same data multiple times (epochs)
DCLM: DataComp for Language Models—a dataset collection used as the source for pretraining documents
QK-norm: Query-Key Normalization—a stability technique in Transformer attention layers applied to the query and key vectors