Bootstrapping: Estimating the value of a state based on the estimated value of the next state (standard in Q-learning), which can be unstable.
Bellman Operator: The recursive update rule used to train value functions: Q(s,a) = r + γ V(s').
Heuristic: In this paper, a value estimate derived from domain knowledge or data, specifically Monte-Carlo returns calculated from the offline dataset.
Monte-Carlo Return: The actual sum of discounted rewards observed in a trajectory from the dataset.
Deadly Triad: The instability caused by combining off-policy learning, bootstrapping, and function approximation.
SoTA: State-of-the-art.
CQL: Conservative Q-Learning—an offline RL algorithm that penalizes Q-values for actions outside the dataset.
IQL: Implicit Q-Learning—an offline RL algorithm that avoids querying out-of-sample actions during value updates.