RLHF: Reinforcement Learning from Human Feedback—a method to align language models by training a reward model on human preferences and optimizing the policy against it
Reward Hacking: A phenomenon where an agent exploits loopholes in the reward function to maximize points without actually performing the intended task correctly
Linear Mode Connectivity: The property where two neural networks connected by a linear path in weight space have low loss along that entire path (requires shared pre-training)
SFT: Supervised Fine-Tuning—the initial phase of training a model on high-quality instruction-response pairs
Baklava: A specific diversity strategy in WARM where reward models are initialized from different checkpoints along a single SFT trajectory
KL divergence: Kullback-Leibler divergence—a statistical distance measure used here to prevent the RL policy from drifting too far from the original SFT model
Prediction Ensembling: Running multiple independent models and averaging their outputs (logits or scores) at inference time
OOD: Out-of-Distribution—data that differs significantly from the data seen during training
BoN: Best-of-N—a sampling strategy where N candidates are generated and the one with the highest reward score is selected
Weight Averaging: Averaging the parameters (weights) of multiple neural networks to create a single merged network