ArmoRM: Absolute-Rating Multi-Objective Reward Model—the proposed architecture that predicts multiple specific reward scores (helpfulness, safety, etc.) instead of one generic score
MoE: Mixture-of-Experts—a neural network architecture where different 'experts' (here, reward objectives) are weighted dynamically by a gating network
Bradley-Terry model: A statistical model used to predict the probability that one item is preferred over another based on their score difference
RLHF: Reinforcement Learning from Human Feedback—fine-tuning LLMs using a reward model trained on human preferences
verbosity bias: The tendency of reward models (and consequently aligned LLMs) to prefer longer responses regardless of their actual quality
RewardBench: A benchmark designed to evaluate reward models on various capabilities like chat, reasoning, and safety
gating network: A small neural network that takes the input context and outputs weights (summing to 1) to combine the multi-objective reward scores