SAC: Soft Actor-Critic—an off-policy RL algorithm that maximizes a trade-off between expected return and policy entropy (randomness)
LayerNorm: Layer Normalization—a technique that normalizes the inputs across the features for each layer, used here to bound the output magnitude of the Q-function
UTD: Update-To-Data ratio—the number of gradient updates performed for every single step taken in the environment
Symmetric Sampling: A data loading strategy where each training batch consists of exactly 50% samples from the offline dataset and 50% from the online replay buffer
OOD: Out-of-Distribution—states or actions not present in the training dataset, often leading to erroneous value estimates in RL
Q-function: A 'critic' network that estimates the expected future reward of taking a specific action in a specific state
Bellman backup: The update rule in RL that brings the current value estimate closer to the reward plus the discounted value of the next state
IQL: Implicit Q-Learning—a prior offline RL method that avoids querying values of unseen actions to remain conservative
Ensemble: Using multiple neural networks (critics) to estimate the same value, helping to reduce variance and estimate uncertainty