RLAIF: Reinforcement Learning from AI Feedback—using an AI model to generate preference labels instead of humans
Instructable Reward Model: A reward model that accepts text principles as input alongside the prompt and response, generating scores conditioned on those principles
Reward Hacking: When an RL agent exploits flaws in the reward function to get high scores without actually achieving the desired goal (e.g., being verbose to look helpful)
Dromedary-2: The specific AI assistant model developed in this paper based on Llama-2-70b using the SALMON method
Synthetic Preference: Training data for the reward model generated by sampling two responses and asking an LLM to judge which is better based on a specific principle
MT-Bench: A challenging multi-turn benchmark for evaluating chat assistants using GPT-4 as a judge
SFT: Supervised Fine-Tuning—training a model on a dataset of high-quality instruction-response pairs
PPO: Proximal Policy Optimization—an RL algorithm used to update the policy model
LLM-Bar: An adversarial benchmark designed to test if models can resist confusion from misleading instructions or biases