SFT: Supervised Fine-Tuning—training a model on labeled instruction-response pairs to teach it specific formats or tasks
RLHF: Reinforcement Learning from Human Feedback—a technique to align models using human preferences
GRPO: Group Relative Policy Optimization—a reinforcement learning method that optimizes policies using relative advantages within a group of sampled responses
Data contamination: Direct or near-duplicate overlap between benchmark evaluation examples and the corpora used during model training
GSM8K: A popular dataset of grade-school math word problems used to evaluate LLM reasoning
MBPP: Mostly Basic Python Problems—a benchmark for evaluating basic Python coding capabilities
GSMPlus: An uncontaminated math benchmark created from GSM8K via adversarial edits, used to measure true generalization
HumanEval: A high-quality Python coding benchmark used as an uncontaminated counterpart to MBPP
Base model: An LLM that has only undergone pre-training on raw text, without task-specific fine-tuning or alignment
Over-estimation: When a model scores higher on a benchmark due to memorizing leaked test data rather than possessing the underlying capability