RALM: Retrieval Augmented Language Model—an LLM enhanced with the ability to access external data during generation
RM: Reward Model—a model trained to predict human preferences between different model outputs, used to guide RLHF
SFT: Supervised Fine-Tuning—training a model on labeled examples (instruction-response pairs) before alignment/RLHF
LLM-as-a-judge: Using strong LLMs (e.g., GPT-4) to evaluate and score the outputs of other models, acting as a proxy for human evaluation
DPO: Direct Preference Optimization—an algorithm for aligning language models to preferences without training an explicit reward model first
PPO: Proximal Policy Optimization—an RL algorithm that optimizes a policy using reward signals, often used in RLHF
BoN: Best-of-N sampling—generating N responses and using a reward model to select the highest-scoring one
Pearson correlation: A statistic measuring the linear correlation between two sets of data (here, between automated judges and human annotators)
discriminative RM: A reward model that takes a prompt and response and outputs a scalar score representing quality
generative RM: A reward model prompted to generate text (e.g., 'Response A is better') to indicate preference
implicit RM: Using the probabilities from a DPO-trained policy model as an implicit reward signal
multi-hop reasoning: A reasoning process that requires connecting pieces of information from multiple different documents to answer a query