Offline RL: Reinforcement learning that learns a policy solely from a fixed dataset without further environment interaction
Epistemic Uncertainty: Uncertainty arising from lack of data/knowledge, as opposed to inherent stochasticity (aleatoric)
Robust MDP: An MDP formulation where transition probabilities are chosen from an uncertainty set to minimize the agent's return (worst-case scenario)
Transition Kernel: The function p(s'|s,a) defining the probability of moving to state s' given state s and action a
Uncertainty Set: A set of plausible transition kernels consistent with the data
KL Divergence: A measure of how one probability distribution differs from a second, reference probability distribution
Bellman Operator: A function that updates value estimates based on the immediate reward and the estimated value of the next state
Contraction Mapping: A function that brings points closer together, guaranteeing convergence to a unique fixed point
Surrogate Objective: A substitute objective function that is easier to optimize but whose improvement guarantees improvement on the original objective
PMDB: Pessimistic Model-based Policy Optimization—a baseline offline RL method
RAMBO: Robust Adversarial Model-Based Offline RL—a baseline method that modifies dynamics to minimize value
MOReL: Model-Based Offline Reinforcement Learning—a baseline using a pessimism penalty based on uncertainty
CQL: Conservative Q-Learning—a model-free baseline that regularizes Q-values
D4RL: Datasets for Deep Data-Driven Reinforcement Learning—a standard benchmark suite