LLM: Large Language Model—a deep learning model trained on vast amounts of text to generate human-like language
OCR: Optical Character Recognition—technology used to convert images of text (like scanned books) into machine-readable text formats
OSI: Open Source Initiative—an organization that defines standards for what constitutes 'open source' software and AI
SFT: Supervised Fine-Tuning—training a model on labeled examples (instructions and answers) to teach it how to follow user commands
RLHF: Reinforcement Learning with Human Feedback—a method to align models with human preferences using reward signals
DPO: Direct Preference Optimization—an algorithm for aligning language models to preferences without a separate reward model
perplexity: A measurement of how well a probability model predicts a sample; lower perplexity indicates the model is less 'surprised' by the text
casual language model: A model trained to predict the next token in a sequence based only on previous tokens
foundation model: A large-scale model trained on broad data that can be adapted (e.g., fine-tuned) to a wide range of downstream tasks
CCNet: A pipeline for extracting high-quality monolingual datasets from web crawl data, often used to filter low-quality text
MinHash: A technique used for estimating the similarity between two sets, commonly used for deduplicating large datasets