Dual RL: Unification and New Methods for Reinforcement and Imitation Learning

📝 Paper Summary

Offline Reinforcement Learning Offline Imitation Learning Convex Optimization in RL

This paper unifies multiple existing offline reinforcement and imitation learning algorithms under a single dual linear programming framework, revealing their limitations and leading to two new methods: a discriminator-free imitation learner (ReCOIL) and a more stable implicit value learner (f-DVL).

Core Problem

Prior off-policy algorithms often suffer from training instability, value overestimation, or restrictive assumptions (like requiring suboptimal data to cover expert visitations) due to unprincipled handling of distribution mismatch.

Why it matters:

Imitation learning methods relying on adversarial discriminators struggle when expert data is sparse or disjoint from offline data, leading to compounding errors.
Implicit policy improvement methods like XQL (Extreme Q-Learning) use unstable loss functions (Gumbel regression) that cause training divergence.
A lack of theoretical unification obscures the root causes of these failures, preventing the design of principled fixes.

Concrete Example: In an offline imitation task where the dataset contains 'medium' quality data plus very few expert trajectories, methods like SMODICE fail (negative returns) because they require learning a density ratio discriminator, which overfits. ReCOIL succeeds by matching a mixture distribution instead.

Key Novelty

Dual RL Framework + ReCOIL & f-DVL

Demonstrates that diverse algorithms (CQL, IQLearn, XQL) are all instances of the dual formulation of regularized policy optimization with specific choices of f-divergence and constraints.
Proposes ReCOIL: An imitation learning method that matches a mixture of expert and suboptimal distributions, eliminating the need for a discriminator or coverage assumptions.
Proposes f-DVL: A family of value-learning algorithms that replaces the unstable exponential loss of XQL with stable surrogates (like Chi-squared or Total Variation) derived from the dual objective.

Architecture

The logic flow for ReCOIL (Algorithm 1) and f-DVL (Algorithm 2). Both follow an iterative update: train Critic (Q/V) using dual objectives, then update Policy (pi) using implicit maximization.

Evaluation Highlights

ReCOIL achieves 108.18 normalized return on Hopper-random+expert, significantly outperforming SMODICE (101.61) and IQLearn (1.85) in low-coverage settings.
f-DVL (using Total Variation) reaches 98.0 normalized return on Hopper-medium-replay-v2, surpassing XQL (94.0) and CQL (95.0) while exhibiting stable training curves.
ReCOIL successfully recovers reward functions with 0.98 Pearson correlation to ground truth on Hopper, proving it captures expert intent better than discriminator-based methods.

Breakthrough Assessment

8/10

Provides a strong theoretical unification that cleans up the landscape of offline RL/IL. The resulting methods (ReCOIL, f-DVL) offer practical, principled improvements over strong baselines like XQL and SMODICE.

⚙️ Technical Details

Problem Definition

Setting: Infinite horizon discounted Markov Decision Process (MDP) in offline settings (RL and Imitation Learning)

Inputs: Offline dataset of transitions (s, a, s'), optionally with rewards r (for RL) or expert demonstrations (for IL)

Outputs: Policy π that maximizes expected cumulative return (RL) or matches expert occupancy (IL)

Pipeline Flow

Dual Formulation (converts constrained LP to unconstrained optimization)
Algorithm Instantiation (select f-divergence and constraints)
Training Loop (Update Q/V and Policy)

System Modules

Q-function Learner (Critic Learning)

Learns the dual variable Q(s,a) corresponding to the Bellman flow constraints

Model or implementation: MLP (Standard RL architecture)

Value Function Learner (f-DVL only) (Critic Learning)

Learns state value V(s) via implicit maximization

Model or implementation: MLP

Policy Extractor

Extracts policy from learned Q/V values

Model or implementation: MLP

Novel Architectural Elements

ReCOIL objective: A specific dual formulation matching a mixture distribution (beta * d_suboptimal + (1-beta) * d_expert) rather than a ratio, removing the discriminator.
f-DVL surrogates: Polynomial loss functions derived from Total Variation or Chi-squared divergences to replace the exponential Gumbel loss in implicit value learning.

Modeling

Base Model: Standard feed-forward neural networks (MLPs) for Actor and Critic (typically 2-3 layers, 256 units)

Training Method: Offline RL/IL via Dual Optimization

Objective Functions:

Purpose: Train ReCOIL Q-function.

Formally: Minimize beta*(E_suboptimal[Q] - E_expert[Q]) + 0.25 * E_mixture[(gamma*Q' - Q)^2]
Purpose: Train f-DVL Value function.

Formally: min_V (1-lambda)*E[V] + lambda*E[surrogate_f*(Q_target - V)]
Purpose: Extract Policy.

Formally: Maximize E[exp(alpha * (Q - V)) * log pi(a|s)] (Advantage Weighted Regression)

Key Hyperparameters:

beta: 0.5 (ReCOIL mixing ratio)
alpha: Temperature parameter for policy extraction (varies by task, e.g., 3.0 or 10.0)
tau: Expectile parameter (e.g., 0.7 or 0.9)
+ 3 more
batch_size: 256
learning_rate: 3e-4
lambda: Weight for f-DVL dual objective

Compute: Single GPU training (standard for D4RL benchmarks)

Comparison to Prior Work

vs. SMODICE: ReCOIL avoids learning a discriminator and relaxes the coverage assumption by matching a mixture distribution.
vs. XQL: f-DVL uses stable polynomial loss functions (from Chi2/TV) instead of unstable exponential Gumbel loss.
vs. IQLearn: ReCOIL explicitly incorporates suboptimal data in a principled manner via the mixture dual, whereas IQLearn focuses on expert data.
+ 1 more
vs. DEMODICE [not cited in paper]: ReCOIL is simpler and avoids the specific structural constraints of DEMODICE by using the unified dual framework.

Limitations

ReCOIL introduces a hyperparameter beta (mixing ratio) which needs tuning.
f-DVL relies on surrogate conjugate functions because the exact conjugates are not defined on the entire real line.
Evaluation is limited to MuJoCo locomotion and manipulation tasks; no high-dimensional visual domains.
The theoretical unification assumes strictly linear constraints, while deep RL involves non-linear approximation.

Reproducibility

Code: https://hari-sikchi.github.io/dual-rl/

Code is publicly available at https://hari-sikchi.github.io/dual-rl/. The paper includes extensive derivation in appendices. Standard D4RL datasets are used. Hyperparameters for all tasks are listed in the appendix.

📊 Experiments & Results

Evaluation Setup

Standard Offline RL and IL benchmarks using D4RL datasets

Benchmarks:

D4RL Locomotion (Continuous Control (Hopper, Walker2d, HalfCheetah, Ant))
D4RL Manipulation (Robotic Manipulation (Pen, Door, Hammer, Relocate, Kitchen))

Metrics:

Normalized Return (0-100 scale based on random/expert)
Reward Recovery Correlation (Pearson correlation)
Statistical methodology: Averaged over 7 seeds with standard deviation reported

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Offline Imitation Learning: ReCOIL outperforms baselines in settings with limited expert data and diverse suboptimal data.
random+hopper	Normalized Return	101.61	108.18	+6.57
medium+expert walker2d	Normalized Return	2.62	108.54	+105.92
medium+few-expert walker2d	Normalized Return	73.30	91.25	+17.95
Offline Reinforcement Learning: f-DVL shows competitive or superior performance compared to XQL and CQL.
hopper-medium-replay-v2	Normalized Return	94.0	98.0	+4.0
antmaze-umaze-v0	Normalized Return	47.7	87.7	+40.0
antmaze-medium-diverse-v0	Normalized Return	0.0	60.2	+60.2

Experiment Figures

Comparison of policy visitation distribution estimation and reward recovery.

Training curves for XQL vs f-DVL (Chi2 and TV variants) on AntMaze tasks.

Main Takeaways

ReCOIL significantly outperforms discriminator-based methods (SMODICE, ORIL) in 'few-shot' expert data regimes by avoiding density ratio estimation.
f-DVL stabilizes training compared to XQL, preventing divergence in implicit value learning and achieving higher final returns in difficult AntMaze tasks.
The dual framework allows accurate estimation of policy visitation distributions, which is critical for correcting distribution shift in offline settings.
ReCOIL recovers reward functions that are highly correlated (0.92-0.98) with ground truth, validating its interpretation as an energy-based model.

📚 Prerequisite Knowledge

Prerequisites

Lagrangian Duality and Convex Conjugates
Linear Programming formulation of RL
f-divergences (KL, Chi-squared, Total Variation)
Offline Reinforcement Learning (Conservative Q-Learning, Implicit Q-Learning)

Key Terms

Dual RL: A framework solving the unconstrained dual problem of the state-action visitation distribution optimization under linear Bellman flow constraints.

ReCOIL: RElaxed Coverage for Off-policy Imitation Learning—a discriminator-free method matching a mixture of expert and suboptimal distributions.

f-DVL: f-Dual V Learning—a family of offline RL algorithms using stable f-divergence surrogates for implicit value maximization.

XQL: Extreme Q-Learning—a prior method utilizing Gumbel regression for value learning, which the paper identifies as a specific instance of Dual RL with reverse-KL divergence.

f-divergence: A measure of the difference between two probability distributions; examples include KL divergence and Pearson Chi-squared.

implicit maximizer: A technique to estimate the maximum value of a function (like Q-value) over a distribution without explicit optimization, often using expectile or Gumbel regression.

Bellman flow constraints: Linear constraints in the LP formulation of RL ensuring that the inflow of probability mass into a state matches the outflow.

coverage assumption: The assumption in offline IL that the suboptimal dataset covers the state-action pairs visited by the expert policy.

SMODICE: State-matching offline distribution correction estimation—a baseline IL method relying on the coverage assumption.