PPO: Proximal Policy Optimization—an on-policy RL algorithm used here for the initial privileged teacher policy
DDPG: Deep Deterministic Policy Gradient—an off-policy RL algorithm used here for the visual policy to efficiently reuse data
CaT: Constraints as Terminations—a method to enforce safety constraints by terminating the episode when constraints are violated, treating them as terminal states
RLPD: Reinforcement Learning with Prior Data—a technique to accelerate RL by filling the replay buffer with demonstrations from a prior controller
privileged information: Data available only in simulation (e.g., exact terrain height maps, obstacle positions) used to train a teacher policy but unavailable to the real robot
distillation: A process where a 'student' neural network learns to mimic the output of a 'teacher' network; often used to transfer privileged behaviors to vision-based agents
observability gap: The discrepancy between what a privileged teacher knows (everything) and what a visual student can see (limited field of view, occlusions), making perfect imitation impossible
PD controller: Proportional-Derivative controller—a feedback control loop mechanism widely used in industrial control systems
REDQ: Randomized Ensembled Double Q-learning—an RL technique using an ensemble of critics to reduce overestimation bias, enabling high update-to-data ratios
sim-to-real: The process of transferring a policy trained in a physics simulator to a physical robot
warm-start: Initializing the training process with pre-collected data or pre-trained weights to speed up learning