← Back to Paper List

Maximum Entropy Reinforcement Learning with Diffusion Policy

Xiaoyi Dong, Jian Cheng, Xi Sheryl Zhang
International Conference on Machine Learning (2025)
RL

📝 Paper Summary

Maximum Entropy Reinforcement Learning Diffusion Models for Control
MaxEntDP integrates diffusion models into the Soft Actor-Critic framework using Q-weighted noise estimation and ODE-based entropy calculation to enable effective multimodal exploration in complex environments.
Core Problem
Standard Soft Actor-Critic (SAC) implementations use Gaussian policies, which are unimodal and limited in exploration capacity, causing agents to get trapped in local optima in complex multi-goal tasks.
Why it matters:
  • Unimodal policies explore only a single behavioral mode, failing to capture the full distribution of optimal strategies favored by the Maximum Entropy objective
  • Complex real-world environments often have multiple feasible solutions; restriction to a Gaussian policy prevents the agent from discovering or maintaining these diverse options
  • Existing generative policy alternatives (GANs, VAEs) often lack the stability or expressiveness required for robust MaxEnt RL
Concrete Example: In a navigation task with multiple distinct paths to a goal (e.g., left and right corridors), a Gaussian policy collapses to a single path (e.g., left only), whereas the optimal MaxEnt policy should maintain probability mass on both valid trajectories.
Key Novelty
MaxEnt RL with Diffusion Policy (MaxEntDP)
  • Replaces the Gaussian policy in SAC with a diffusion model, enabling the representation of complex multimodal action distributions
  • Introduces 'Q-weighted Noise Estimation' to train the diffusion policy to approximate the exponential of the Q-function, solving the intractability of the standard KL divergence update
  • Employs numerical integration of the probability flow ODE (Ordinary Differential Equation) to compute exact log-probabilities required for the soft Q-function update
Breakthrough Assessment
7/10
Theoretically sound integration of diffusion models into MaxEnt RL, addressing the specific challenges of entropy calculation and policy targeting. Validated against standard baselines.
×