← Back to Paper List

Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes

Zihan Zhang, Qiaomin Xie
Princeton University, University of Wisconsin-Madison
Annual Conference Computational Learning Theory (2023)
RL

📝 Paper Summary

Reinforcement Learning Theory Average-Reward MDPs
A model-free reinforcement learning algorithm achieves near-optimal regret in average-reward environments by estimating value differences between states rather than absolute values to reduce variance.
Core Problem
Model-free algorithms for average-reward MDPs historically suffer from suboptimal regret (scaling with T^(2/3) or worse) compared to model-based methods, or require restrictive assumptions like uniform mixing.
Why it matters:
  • Average-reward formulations are more appropriate than discounted settings for continuing operations like data center resource allocation and network congestion control.
  • Model-based methods achieve optimal regret but have high space complexity (storing transition kernels), while prior model-free methods fail to match this statistical efficiency.
  • Closing the gap between model-based and model-free efficiency is a fundamental open question in RL theory.
Concrete Example: In a weakly communicating MDP, existing model-free methods like Optimistic Q-learning achieve regret scaling with T^(2/3). To match the theoretical lower bound, an algorithm needs regret scaling with square root of T. The proposed UCB-AVG achieves this, reducing the dependency on the horizon T.
Key Novelty
Variance-Reduced Q-learning with Value-Difference Estimation
  • Approximates the average-reward problem using a discounted MDP with a carefully chosen discount factor close to 1.
  • Uses reference-advantage decomposition to reduce variance, but crucially maintains estimates of *differences* between state values (relative bias) rather than absolute values.
  • Maintains a space-efficient graph of state-pairs to track these value differences, ensuring the reference value function stays within a tight confidence region.
Evaluation Highlights
  • Achieves regret of Õ(S^5 A^2 sp(h*) √T), the first model-free algorithm to attain optimal √T dependence for weakly communicating MDPs.
  • Improves upon prior best model-free regret of Õ(T^(2/3)) established by Optimistic Q-learning.
  • In the simulator setting, the refined algorithm achieves sample complexity Õ(SA sp(h*)^2 ε^(-2)), matching the optimal ε dependence.
Breakthrough Assessment
9/10
Solves a significant open problem in RL theory by providing the first model-free algorithm to achieve sqrt(T) regret for general weakly communicating average-reward MDPs, matching the optimal rate in T.
×