Autoregressive RM: A reward model trained to predict the reward of the next token given the history, parameterized as a log-probability distribution.
Trajectory-level RM: A standard reward model that assigns a scalar score only to a complete text sequence, often failing to accurately score partial sequences.
Test-time alignment: Aligning an LLM's output to preferences during inference (decoding) without updating the model's weights.
Weak-to-strong generalization: Using a smaller, weaker model (e.g., 7B RM) to supervise or guide a larger, stronger model (e.g., 70B LLM).
DPO: Direct Preference Optimization—a training-time method that fine-tunes LLMs on preference pairs without an explicit reward model.
KL divergence: A measure of difference between two probability distributions; used here to ensure the aligned model doesn't drift too far from the base model's capabilities.
SFT: Supervised Fine-Tuning—the initial training phase of an LLM on high-quality instruction data.