Sampler: The policy/agent trained alongside the reward model during the reward learning process to generate trajectory segments for labeling
Relearner: A new, randomly initialized policy trained from scratch using the *frozen* learned reward function to test its robustness
Bradley-Terry model: A probabilistic model predicting the preference between two items based on the difference of their underlying rewards
Reward Hacking: When an RL agent exploits errors or loopholes in a misspecified reward function to get high rewards without performing the intended task
Soft Actor-Critic (SAC): An off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework
RL Budget: The total number of environment interactions (timesteps) the sampler agent is allowed during the reward learning phase
EPIC distance: A metric measuring the difference between two reward functions by canonicalizing them (invariant to shaping/scale) and computing the L2 norm over a coverage distribution
Reward Ensemble: Training multiple reward models on bootstrapped data and using their mean output as the reward signal to reduce uncertainty/variance