CRM: Collaborative Reward Modeling—the proposed framework where two models filter training data for each other
RLHF: Reinforcement Learning from Human Feedback—a method to align language models using human preferences
Reward Model (RM): A model trained to predict which of two responses a human would prefer
DPO: Direct Preference Optimization—an algorithm that optimizes the policy directly from preferences without an explicit reward model loop (mentioned as a CRM extension)
Reward Margin: The difference in reward scores between the preferred and rejected response; used here as a metric for data confidence
NLL loss: Negative Log-Likelihood loss—the standard objective function for training reward models
Peer Review: The mechanism where one model evaluates the training batch of another model to identify and filter noisy samples
Curriculum Learning: A training strategy where the model starts with easy examples and gradually progresses to harder ones
Reward Misgeneralization: When a reward model learns spurious correlations from noisy data, failing to generalize to true human preferences