rubric-agnostic: The ability of a model to evaluate responses based on any provided set of criteria (rubric) rather than being hard-coded to a specific metric like helpfulness
point-wise evaluation: Assessing a single response in isolation and assigning it a score (e.g., 1-5)
pair-wise evaluation: Comparing two responses to the same prompt and selecting the better one
binary evaluation: Making a definitive yes/no judgment on a response (e.g., factual/incorrect)
distillation: Training a smaller student model to mimic the outputs (reasoning traces) of a larger, stronger teacher model
DeepSeek-R1: A strong open-weights reasoning model used here as a teacher to generate synthetic reasoning traces
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model weights
Best-of-N: An inference strategy where the model generates N candidate responses and a reward model selects the best one
DPO: Direct Preference Optimization—a method to align language models to preferences without an explicit reward model during training, though a reward model is often used to curate the data