Dual RL: A framework solving the unconstrained dual problem of the state-action visitation distribution optimization under linear Bellman flow constraints.
ReCOIL: RElaxed Coverage for Off-policy Imitation Learning—a discriminator-free method matching a mixture of expert and suboptimal distributions.
f-DVL: f-Dual V Learning—a family of offline RL algorithms using stable f-divergence surrogates for implicit value maximization.
XQL: Extreme Q-Learning—a prior method utilizing Gumbel regression for value learning, which the paper identifies as a specific instance of Dual RL with reverse-KL divergence.
f-divergence: A measure of the difference between two probability distributions; examples include KL divergence and Pearson Chi-squared.
implicit maximizer: A technique to estimate the maximum value of a function (like Q-value) over a distribution without explicit optimization, often using expectile or Gumbel regression.
Bellman flow constraints: Linear constraints in the LP formulation of RL ensuring that the inflow of probability mass into a state matches the outflow.
coverage assumption: The assumption in offline IL that the suboptimal dataset covers the state-action pairs visited by the expert policy.
SMODICE: State-matching offline distribution correction estimation—a baseline IL method relying on the coverage assumption.