TD learning: Temporal-Difference learning—a method to estimate value functions by bootstrapping from current estimates.
Semi-gradient: An update rule that treats the target value as a fixed constant, ignoring its dependence on the parameters being optimized.
Gradient TD: A family of algorithms that minimize the Bellman error (or Projected Bellman error) via true stochastic gradient descent, correcting for the 'double sampling' problem.
Bellman Error (BE): The difference between a value function and its Bellman update: ||Gamma Q - Q||.
TDRC: TD with Regularized Corrections—a Gradient TD method that learns a correction term to estimate the gradient of the Bellman operator.
Iterated TD (i-TD): Learning a sequence of value functions Q_k where Q_k approximates the Bellman update of Q_{k-1}.
Double Sampling Problem: The issue where an unbiased estimate of the square of the expected Bellman error requires two independent next-states from the same state-action pair.
Target Network: A copy of the value network frozen for a period to stabilize learning targets in Deep RL.