OLMo: Open Language Model—a decoder-only LLM where checkpoints and training data are open-sourced
Dolma: Data for OLMo's pretraining—a large English-centric corpus used to train the OLMo model
KLAR: A multilingual factual knowledge probing dataset containing facts grouped into relation categories
co-occurrence frequency: The number of documents in the pretraining corpus where a fact's subject and object appear together
crosslingual consistency: A metric measuring whether a model answers a factual query correctly in a target language given that it answers correctly in a reference language (usually English)
Latin script: The writing system used by English, French, Spanish, etc., distinct from scripts like Arabic or Cyrillic
crosslingual transfer: The ability of a model to apply knowledge learned in one language (usually English) to perform tasks in another language
Pearson correlation: A statistic measuring the linear correlation between two variables (here, fact frequency and recall accuracy)