Robust MDP: An MDP formulation where transition probabilities are chosen from an uncertainty set to minimize the agent's reward (worst-case optimization).
Average-Reward: A performance criterion maximizing the long-term average reward per time step, rather than a discounted sum of future rewards.
Relative Value Function: A function representing the transient difference between the expected total reward from a state and the long-term average reward.
RVI (Relative Value Iteration): An algorithm for average-reward MDPs that subtracts a reference value (offset) at each step to prevent value estimates from diverging to infinity.
Unichain: A condition where every policy induces a Markov chain with a single recurrent class, ensuring the average reward is independent of the starting state.
Multi-level Monte-Carlo: A sampling technique used here to construct unbiased estimators for non-linear functions (like worst-case operators) that would otherwise be biased if estimated directly from samples.
Bellman Operator: A function update rule that relates the value of a state to the expected value of the next state; in robust RL, this involves a minimization over the uncertainty set.