SFT: Supervised Fine-Tuning—training a pre-trained model on a specific dataset of labeled examples to adapt it to a downstream task
Backtranslation: In this context, generating a question that would result in the source document content as the answer, effectively reversing the generation process to create training data
Knowledge Distillation: Transferring knowledge from a large, capable 'teacher' model (e.g., Llama-3-70B) to a smaller 'student' model (e.g., Llama-3.1-8B)
Chain-of-Thought: A prompting technique where the model generates intermediate reasoning steps before producing the final answer
RLVR: Reinforcement Learning with Verifiable Rewards—using objective correctness checks (like matching a reference answer) to guide model training
GPQA: A challenging QA benchmark requiring graduate-level reasoning in biology, physics, and chemistry
MMLU-Pro: An enhanced version of the MMLU benchmark designed to be more difficult and robust, covering diverse subjects
Greedy decoding: A generation strategy where the model always selects the highest-probability token at each step, ensuring deterministic output
Deduplication: Removing duplicate or near-duplicate entries from a dataset to prevent overfitting and ensure diversity
Self-training: A method where a model generates its own training data or rewards to improve its performance without external human labels