RLHF: Reinforcement Learning with Human Feedback—training agents using guidance from humans rather than pre-defined reward functions
Bradley-Terry model: A statistical model used to predict the probability that one item is preferred over another, commonly used to train reward models from comparison data
IQL: Implicit Q-Learning—an offline RL algorithm that avoids querying out-of-distribution actions by treating value estimation as a supervised learning problem
CQL: Conservative Q-Learning—an offline RL algorithm that learns a conservative lower bound on the value function to prevent overestimation
TD3BC: TD3 with Behavior Cloning—a minimalist offline RL algorithm that adds a behavior cloning regularization term to the standard TD3 objective
Oracle: A model trained using the ground-truth, hand-engineered reward function provided by the environment simulator
ST: Scripted Teacher—synthetic feedback generated by a programmed agent that perfectly follows the ground-truth reward function
CS: Crowd-Sourced—feedback labels collected from real human workers via the Uni-RLHF platform
ex-ante filters: Quality control mechanisms applied *during* or *before* data collection (like qualifying exams or real-time validation) to filter out bad annotators
saliency map: A heatmap representation highlighting which parts of an image observation are most important for decision making, used in visual feedback