weakly communicating MDP: An MDP where the set of states can be partitioned into a set of states that are all reachable from each other, plus a set of transient states.
sp(h*): The span of the optimal bias function h*, defined as max(h*) - min(h*). It measures the maximum difference in relative value between states.
regret: The difference between the total reward collected by the optimal policy and the algorithm's policy over T steps.
reference-advantage decomposition: A technique where the target value is split into a fixed 'reference' part (estimated with low variance) and a residual 'advantage' part.
value-difference estimation: Estimating the gap V(s) - V(s') directly, rather than estimating V(s) and V(s') separately, to tighten confidence intervals.
diameter D: The maximum expected time to go from any state s to any state s' in the MDP under some policy.
mixing time: The time required for a Markov chain to approach its stationary distribution.