Model-Free Robust Average-Reward Reinforcement Learning

📝 Paper Summary

Robust Reinforcement Learning Average-Reward MDPs Model-Free Reinforcement Learning

This paper introduces the first model-free algorithms for robust average-reward MDPs by using multi-level Monte-Carlo estimators to handle non-linear robust Bellman operators.

Core Problem

Standard RL fails under model mismatch (sim-to-real gap), but existing robust RL methods are either model-based (assuming known uncertainty sets) or restricted to discounted settings, leaving a gap for average-reward tasks like queuing or inventory control.

Why it matters:

Average-reward criteria are often more suitable than discounted criteria for continuing tasks (e.g., inventory management, communication networks) but are mathematically harder to optimize robustly.
Model-based methods are impractical when the uncertainty set is not explicitly known and only samples from a nominal model are available.
Non-linear robust Bellman operators bias standard sample-based estimators, causing naive model-free approaches to fail.

Concrete Example: In a recycling robot task, a standard agent learns to 'search' for cans assuming a high success probability. If the environment shifts and searching becomes less reliable, the agent fails. The proposed robust agent learns to 'wait', which yields lower nominal reward but prevents catastrophic failure in the worst-case environment.

Key Novelty

Robust RVI TD and Q-learning with Multi-level Monte-Carlo Estimators

Characterizes the solution structure of the robust average-reward Bellman equation, showing any solution is a relative value function for some worst-case kernel.
Adapts Relative Value Iteration (RVI) to the robust setting by introducing an offset function to stabilize the non-contractive average-reward updates.
Employs Multi-level Monte-Carlo (MLMC) to construct unbiased, bounded-variance estimators for non-linear robust operators (e.g., KL divergence, Wasserstein), overcoming the bias inherent in naive plug-in estimators.

Architecture

The pseudocode for Robust RVI TD and Robust RVI Q-learning, defining the iterative update rules.

Evaluation Highlights

Robust RVI Q-learning converges to the optimal robust average-reward on Garnet problems, matching model-based ground truth.
In an Inventory Control task, the robust policy maintains positive profitability under severe demand distribution shifts (perturbation magnitude b=0.25), while standard Q-learning performance degrades significantly.
In a Recycling Robot task, the robust agent switches to a conservative 'wait' policy that is stable across perturbed environments, whereas standard Q-learning retains a risky 'search' policy that fails when parameters change.

Breakthrough Assessment

8/10

Fills a significant theoretical gap by providing the first model-free algorithms for robust average-reward MDPs with rigorous convergence proofs and practical estimators for five different uncertainty sets.

⚙️ Technical Details

Problem Definition

Setting: Infinite-horizon Robust Markov Decision Process (RMDP) with average-reward criterion

Inputs: Samples (s, a, r, s') from a nominal transition kernel P

Outputs: Optimal robust policy π maximizing the worst-case average reward over an uncertainty set

Pipeline Flow

Input: Samples from nominal environment
Estimator: Construct unbiased estimate of robust functional using MLMC
Update: Apply RVI update to Q-values using estimated operator and offset
Control: Select greedy policy w.r.t. updated Q-values

System Modules

Robust Operator Estimator

Compute an unbiased estimate of the worst-case transition value

Model or implementation: Multi-level Monte-Carlo (MLMC) estimator

RVI Updater

Update value estimates while preventing divergence

Model or implementation: Stochastic Approximation update

Novel Architectural Elements

Integration of Multi-level Monte-Carlo estimators into the Relative Value Iteration loop to handle the non-linearity of the robust Bellman operator in a model-free manner

Modeling

Base Model: Tabular Q-learning / TD Learning

Training Method: Robust Relative Value Iteration (RVI) Q-learning

Objective Functions:

Purpose: Minimize worst-case performance over uncertainty set.

Formally: g_pi(s) = min_{P in P} lim_{T->inf} E[1/T * sum(r_t)]
Purpose: Solve Robust Bellman Equation.

Formally: Q(s,a) = r(s,a) - g + min_{p in P(s,a)} sum(p(s') * V_Q(s'))

Key Hyperparameters:

learning_rate: 0.01
uncertainty_radius_epsilon: 0.4
offset_function: f(V) = mean(V) or f(V) = V(s_0)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Model-based Robust RVI: Operates in a model-free setting using only samples from the nominal model
vs. Robust Discounted Q-learning: Handles the average-reward criterion which lacks the simple contraction property of discounted settings
vs. Differential Q-learning: Optimizes worst-case performance over an uncertainty set rather than nominal performance

Limitations

Restricted to tabular settings (finite state and action spaces)
Assumes (s,a)-rectangular uncertainty sets
Requires the Unichain assumption (average reward independent of initial state) for all models in the uncertainty set
Computational cost of solving dual optimization problems at each step for complex uncertainty sets (e.g., Wasserstein)

Reproducibility

The paper provides full algorithmic pseudo-code and specific formulas for the unbiased estimators for five uncertainty sets (Contamination, TV, Chi-square, KL, Wasserstein). No code repository is provided.

📊 Experiments & Results

Evaluation Setup

Policy evaluation and control on synthetic and simulated environments

Benchmarks:

Garnet Problem G(30, 20) (Synthetic randomly generated MDP)
Recycling Robot (Classic RL control problem)
Inventory Control (Supply chain management simulation)

Metrics:

Robust Average Reward estimate f(V)
Absolute error relative to model-based ground truth
Average reward under perturbed environments
Statistical methodology: Results averaged over 30 independent runs; 95th/5th percentiles shown as envelopes.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Convergence experiments on the Garnet problem verify that the model-free algorithms recover the true values computed by model-based solvers.
Garnet G(30,20)	Robust Average Reward Estimate	4.5	4.5	0.0

Experiment Figures

Performance of Recycling Robot policies under perturbed environments.

Performance on Inventory Control under varying demand distribution perturbations.

Main Takeaways

Robust RVI TD and Q-learning algorithms converge to the true robust values calculated by model-based methods, validating the unbiasedness of the estimators.
In the Recycling Robot task, robust agents learn a conservative 'wait' policy that remains stable when the environment is perturbed, whereas non-robust agents learn a 'search' policy that fails catastrophically.
In Inventory Control, robust agents outperform non-robust baselines significantly as the perturbation of the demand distribution increases (e.g., shifting uniform bounds).

📚 Prerequisite Knowledge

Prerequisites

Markov Decision Processes (MDPs)
Bellman Equation
Stochastic Approximation
Duality in Optimization

Key Terms

Robust MDP: An MDP formulation where transition probabilities are chosen from an uncertainty set to minimize the agent's reward (worst-case optimization).

Average-Reward: A performance criterion maximizing the long-term average reward per time step, rather than a discounted sum of future rewards.

Relative Value Function: A function representing the transient difference between the expected total reward from a state and the long-term average reward.

RVI (Relative Value Iteration): An algorithm for average-reward MDPs that subtracts a reference value (offset) at each step to prevent value estimates from diverging to infinity.

Unichain: A condition where every policy induces a Markov chain with a single recurrent class, ensuring the average reward is independent of the starting state.

Multi-level Monte-Carlo: A sampling technique used here to construct unbiased estimators for non-linear functions (like worst-case operators) that would otherwise be biased if estimated directly from samples.

Bellman Operator: A function update rule that relates the value of a state to the expected value of the next state; in robust RL, this involves a minimization over the uncertainty set.