← Back to Paper List

Rewarded Region Replay (R3) for Policy Learning with Discrete Action Space

Bangzheng Li, Ningshan Ma, Zifan Wang
arXiv (2024)
RL Memory Benchmark

📝 Paper Summary

Reinforcement Learning Sparse Reward Environments
R3 improves PPO in sparse reward settings by storing successful trajectories in a replay buffer and reusing them via an importance sampling method that discards high-variance samples.
Core Problem
Sparse reward environments challenge PPO (inefficient usage of rare successful trajectories) and DDQN (instability), while adding replay buffers to on-policy methods causes destructive distribution shift.
Why it matters:
  • Traditional on-policy algorithms like PPO discard successful experiences immediately after one update, slowing learning in environments where success is rare.
  • Naive importance sampling to correct distribution shift often introduces high variance, causing policy updates to diverge.
  • Off-policy methods like DDQN are more sample efficient but often less stable than on-policy methods in discrete action spaces.
Concrete Example: In the Minigrid DoorKey environment, an agent may fail for thousands of episodes before stumbling upon the key and door. PPO uses this single success for one gradient update and discards it. R3 stores this success and replays it, allowing the agent to learn from the rare event multiple times.
Key Novelty
Rewarded Region Replay (R3) with Variance-Clipped Importance Sampling
  • Mimics human reflection by storing only successful trajectories (high reward) in a replay buffer to supplement on-policy training data.
  • Uses a modified importance sampling technique that entirely discards data points where the probability ratio between the new and old policy exceeds a threshold, preventing variance explosion.
Evaluation Highlights
  • Significantly outperforms PPO on Minigrid DoorKey and Crossing environments (sparse rewards) [exact numeric deltas not reported in text]
  • Outperforms DDQN (Double Deep Q-Network) on DoorKeyEnv, surpassing a standard off-policy baseline [exact numeric deltas not reported in text]
  • DR3 (Dense R3) variant significantly outperforms PPO on CartPole-V1 (dense rewards) [exact numeric deltas not reported in text]
Breakthrough Assessment
6/10
Proposes a practical, intuitive fix for PPO in sparse reward settings by stabilizing experience replay. While effective on Minigrid, the novelty is an incremental modification of importance sampling.
×