LLM judge: An LLM prompted to evaluate and rank the quality of outputs from other models
RFT: Reinforcement Fine-Tuning—using reinforcement learning to optimize a model against a reward signal
Rubric: A specific, measurable criterion (e.g., 'Is the code efficient?') used to score a response
Whitening: A transformation that decorrelates variables (here, rubric scores) to ensure equal variance and remove redundancy
RLVR: Reinforcement Learning from Verifiable Rewards—RL where the reward is objectively checkable (e.g., code compiles), contrasted here with open-ended tasks
Variance proxy: A bound on the variance of a random variable, used here to derive upper bounds on misclassification probability
Sub-Gaussian: A property of a probability distribution that decays at least as fast as a Gaussian, implying tightly bounded noise