SAC: Soft Actor-Critic—an off-policy RL algorithm that maximizes both expected reward and policy entropy for better exploration
Replay Ratio (RR): The number of gradient update steps taken by the network for every single step taken in the environment
Plasticity: The ability of a neural network to continue learning and adapting its weights throughout training; 'plasticity loss' means the network gets stuck
CDQ: Clipped Double Q-learning—a technique using two critic networks and taking the minimum of their outputs to prevent overestimating values
Layer Norm (LN): Layer Normalization—a technique that normalizes the inputs to a layer across the feature dimension to stabilize training
Spectral Norm (SN): Spectral Normalization—a technique that constrains the Lipschitz constant of a layer by normalizing weights, often used to stabilize discriminators in GANs
GPL: Generalized Pessimism Learning—an RL method that adjusts the level of pessimism in Q-value updates based on estimated error
TOP: Tactical Optimism and Pessimism—an RL method that switches between optimistic and pessimistic updates
Resets: Periodically resetting the weights of the last few layers of the actor/critic networks to restore plasticity
DMC: DeepMind Control Suite—a physics-based simulation benchmark for RL agents
MW: MetaWorld—a multi-task robotics benchmark for manipulation