Infini-gram: A search engine engine using suffix arrays to compute n-gram frequencies over massive corpora (e.g., 4 trillion tokens) with millisecond latency
EM: Exact Match—a metric measuring if the generated answer string exactly matches the ground truth
Dynamic RAG: RAG systems that adaptively decide when to retrieve during generation, rather than always retrieving once at the start
Hallucination: Generated content that is factually incorrect or unfaithful to the source, often produced with high confidence
Co-occurrence: The frequency with which two entities appear together within a specific window (e.g., a document) in the pre-training corpus
SFT: Supervised Fine-Tuning—training a model on labeled examples to follow instructions
BM25: Best Matching 25—a probabilistic information retrieval function that ranks documents based on the query terms appearing in each document
Zero co-occurrence: When two entities never appear together in the same context window in the entire training corpus, strongly suggesting the model has no evidence connecting them
OLMo-2: Open Language Model 2—a fully open-source LLM family where the pre-training data is publicly available, allowing direct statistical analysis