RaR: Rubrics-as-Rewards—using structured criteria (rubrics) instead of scalar scores to evaluate model responses
CRG: Contrastive Rubric Generation—a method to generate rubrics by asking an LLM to identify why one response was preferred over another
Hard Rules: Explicit, objective constraints specified in a user prompt (e.g., 'no more than 5 sentences')
Principles: Implicit, generalizable qualities of good responses (e.g., 'reasoning soundness', 'polite tone')
RLVR: Reinforcement Learning with Verifiable Rewards—alignment using objective success criteria (like math answers or code execution)
GenRM: Generative Reward Model—a reward model that outputs text (like reasoning chains) before a score, rather than just a scalar
SFT: Supervised Fine-Tuning—training a model on labeled examples