Proxy Reward Function: A function (here, an LLM) used to approximate the true, desired reward signal when the ground truth reward is hard to specify.
In-context Learning: The ability of a language model to perform a task given only a few examples (prompts) in its input, without updating its weights.
Zero-shot Prompting: Asking the model to perform a task using only a description, without providing any specific examples.
Few-shot Prompting: Providing the model with a small number (e.g., 1-10) of input-output examples to guide its behavior.
Ultimatum Game: A game where a Proposer offers a split of resources and a Responder accepts or rejects it; often used to study fairness.
Pareto-optimality: A state where no individual's situation can be improved without making another individual's situation worse.
RLHF: Reinforcement Learning from Human Feedback—a method to fine-tune models using rewards learned from human preference data.
DQN: Deep Q-Network—a value-based reinforcement learning algorithm that uses deep neural networks to estimate Q-values.
Parsers: Functions defined in this paper to convert environment states to text strings (input to LLM) and LLM text outputs to integers (reward for RL).