RLHF: Reinforcement Learning from Human Feedback—fine-tuning models using rewards derived from human preferences
Reward Hacking: When an RL agent exploits flaws in the reward model to get high scores without actually satisfying the user's intent
Mixture of Judges (MoJ): A set of evaluators (rule-based or model-based) that check specific constraints (e.g., 'is the code valid?', 'is the response safe?')
PPO: Proximal Policy Optimization—a standard RL algorithm that limits how much the policy changes in each step
DPO: Direct Preference Optimization—an algorithm that optimizes the policy directly from preference data without an explicit reward model loop
CRPG: Calibrated-Regularized Policy Gradient—one of the proposed optimizers effectively handling constraints
CRRAFT: Calibrated-Regularized Reward Ranking Finetuning—a proposed optimizer based on filtering and ranking samples
CODPO: Constrained Online Direct Preference Optimization—a proposed constrained version of online DPO
Pareto optimal: A state where no objective can be improved without worsening another (e.g., improving safety without hurting helpfulness)
SFT: Supervised Fine-Tuning—the initial phase of training on high-quality demonstrations
KL penalty: Kullback-Leibler divergence penalty—keeps the RL model from drifting too far from the reference model