Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes

📝 Paper Summary

Reinforcement Learning Theory Average-Reward MDPs

A model-free reinforcement learning algorithm achieves near-optimal regret in average-reward environments by estimating value differences between states rather than absolute values to reduce variance.

Core Problem

Model-free algorithms for average-reward MDPs historically suffer from suboptimal regret (scaling with T^(2/3) or worse) compared to model-based methods, or require restrictive assumptions like uniform mixing.

Why it matters:

Average-reward formulations are more appropriate than discounted settings for continuing operations like data center resource allocation and network congestion control.
Model-based methods achieve optimal regret but have high space complexity (storing transition kernels), while prior model-free methods fail to match this statistical efficiency.
Closing the gap between model-based and model-free efficiency is a fundamental open question in RL theory.

Concrete Example: In a weakly communicating MDP, existing model-free methods like Optimistic Q-learning achieve regret scaling with T^(2/3). To match the theoretical lower bound, an algorithm needs regret scaling with square root of T. The proposed UCB-AVG achieves this, reducing the dependency on the horizon T.

Key Novelty

Variance-Reduced Q-learning with Value-Difference Estimation

Approximates the average-reward problem using a discounted MDP with a carefully chosen discount factor close to 1.
Uses reference-advantage decomposition to reduce variance, but crucially maintains estimates of *differences* between state values (relative bias) rather than absolute values.
Maintains a space-efficient graph of state-pairs to track these value differences, ensuring the reference value function stays within a tight confidence region.

Evaluation Highlights

Achieves regret of Õ(S^5 A^2 sp(h*) √T), the first model-free algorithm to attain optimal √T dependence for weakly communicating MDPs.
Improves upon prior best model-free regret of Õ(T^(2/3)) established by Optimistic Q-learning.
In the simulator setting, the refined algorithm achieves sample complexity Õ(SA sp(h*)^2 ε^(-2)), matching the optimal ε dependence.

Breakthrough Assessment

9/10

Solves a significant open problem in RL theory by providing the first model-free algorithm to achieve sqrt(T) regret for general weakly communicating average-reward MDPs, matching the optimal rate in T.

⚙️ Technical Details

Problem Definition

Setting: Infinite-horizon average-reward Markov Decision Process (MDP)

Inputs: State space S, Action space A, unknown transition kernel P, unknown reward function r

Outputs: A policy π that minimizes cumulative regret R(T) = Σ(ρ* - r_t)

Pipeline Flow

Initialize Graph G (Value-Difference Estimators)
Construct Reference V_ref from G
Run UCB-REF (Variance-Reduced Q-learning)
Update Graph G with new samples

System Modules

UCB-REF

Performs optimistic Q-learning using reference-advantage decomposition to estimate Q-values with reduced variance

UpdateG

Updates the graph of value-difference estimators, ensuring space efficiency by removing edges from cycles

Novel Architectural Elements

Graph-based maintenance of value-difference estimators: Keeps O(S) estimators (edges) to reconstruct reference values, updating edges dynamically.
Projection operator ProjC: Projects value estimates into a confidence set defined by bias differences rather than absolute value norms.

Modeling

Base Model: Tabular Q-learning variants (UCB-AVG, Monotone Q-learning)

Training Method: Online Reinforcement Learning

Key Hyperparameters:

gamma (discount factor): 1 - 1/H where H scales with sqrt(T)
H (horizon): sqrt(T / (300 S^6 A^2 log^2(T)))
alpha_n (learning rate): (H+1)/(H+n)

Compute: Time cost: Õ(SAT + S^2T + S^4A); Space complexity: O(SA)

Comparison to Prior Work

vs. UCRL2: UCB-AVG is model-free (O(SA) space) while UCRL2 is model-based (O(S^2A) space), but UCB-AVG matches the sqrt(T) regret scaling.
vs. Optimistic Q-learning: UCB-AVG improves regret from T^(2/3) to sqrt(T) by using variance reduction and value-difference estimation.
vs. POLITEX: UCB-AVG handles weakly communicating MDPs, whereas POLITEX is analyzed for ergodic MDPs and has worse T dependence.

Limitations

Dependency on S (State space size) is high (S^5) compared to model-based methods (S).
Requires knowledge of the span of the optimal bias function sp(h*), or an upper bound.
Computational efficiency per step involves O(S^2) for projection, which is higher than standard Q-learning O(1).

Reproducibility

Theoretical paper. Algorithms are fully described in pseudocode (Algorithms 1-4, 6). No code provided. Proofs are in appendices.

📊 Experiments & Results

Evaluation Setup

Theoretical Analysis (Regret Bounds and Sample Complexity)

Benchmarks:

Weakly Communicating MDP (Online) (Regret Minimization)
Weakly Communicating MDP (Simulator) (PAC Learning (Finding epsilon-optimal policy))

Metrics:

Cumulative Regret R(T)
Sample Complexity
Statistical methodology: Mathematical Proof (High probability bounds)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Weakly Communicating MDP	Regret dependence on T	0.66	0.5	-0.16
Weakly Communicating MDP	Sample Complexity dependence on epsilon	3	2	-1

Main Takeaways

The paper proves that model-free algorithms can achieve the optimal sqrt(T) regret dependence for average-reward MDPs, closing a long-standing theoretical gap.
The key to this improvement is avoiding the worst-case dependency on the mixing time or diameter in the variance term by focusing on relative value differences.
For the generative model setting, the proposed algorithm achieves near-optimal sample complexity, improving the epsilon dependence from cubic to quadratic.

📚 Prerequisite Knowledge

Prerequisites

Markov Decision Processes (average reward vs discounted)
Regret bounds and Sample Complexity
Q-learning
Variance reduction techniques

Key Terms

weakly communicating MDP: An MDP where the set of states can be partitioned into a set of states that are all reachable from each other, plus a set of transient states.

sp(h*): The span of the optimal bias function h*, defined as max(h*) - min(h*). It measures the maximum difference in relative value between states.

regret: The difference between the total reward collected by the optimal policy and the algorithm's policy over T steps.

reference-advantage decomposition: A technique where the target value is split into a fixed 'reference' part (estimated with low variance) and a residual 'advantage' part.

value-difference estimation: Estimating the gap V(s) - V(s') directly, rather than estimating V(s) and V(s') separately, to tighten confidence intervals.

diameter D: The maximum expected time to go from any state s to any state s' in the MDP under some policy.

mixing time: The time required for a Markov chain to approach its stationary distribution.