SBM: Side-Branch Models—auxiliary lightweight models designed to extract specific features (facts, style, etc.) from the input before the main reward model runs
SRM: Structural Reward Model—the proposed framework integrating SBMs with a standard reward model
BoN: Best-of-N—a sampling strategy where multiple responses are generated, and the best one is selected
Bradley-Terry Model: A probability model used to predict the outcome of a pairwise comparison, commonly used as the loss function for training reward models
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices
o1: Refers to OpenAI's reasoning model, used here as a strong 'teacher' to filter training data
GRM: Generative Reward Model—a reward model that generates a text explanation or Chain-of-Thought before outputting a score