Zhao Yang, Thomas. M. Moerland, Mike Preuss, Aske Plaat
Leiden Institute of Advanced Computer Science, Leiden University
arXiv
(2023)
MemoryRL
📝 Paper Summary
Episodic ControlReinforcement Learning
2M is a reinforcement learning agent that switches between fast episodic memory and slow parametric learning for action selection, sharing data between them to combine speed with asymptotic optimality.
Core Problem
Deep Reinforcement Learning (DRL) is sample-inefficient due to slow reward propagation and representation learning, while Episodic Memory (EM) learns quickly but struggles with generalization and stochasticity.
Why it matters:
Current DRL methods require massive amounts of interaction data, making them impractical for real-world tasks where samples are expensive
Pure episodic approaches (like MFEC) hit performance plateaus early because they lack the generalization capabilities of neural networks
Existing hybrid methods mostly use memory to estimate training targets rather than for direct control, failing to fully exploit the speed of episodic action selection
Concrete Example:In a simple grid world, an episodic agent quickly finds a path to a reward but gets stuck on a sub-optimal route because it lacks the look-ahead mechanism to correct itself. A standard RL agent eventually finds the optimal path but takes many more episodes to propagate the reward signal back to the start state.
Key Novelty
Dual-Memory Switching & Data Sharing
Maintain two distinct decision-making systems: a fast non-parametric Episodic Control (EC) memory and a slow parametric Reinforcement Learning (RL) network
Use a probabilistic schedule to select which memory controls the agent per episode, favoring EC early for speed and RL later for optimality
Share all collected experience between systems: EC data trains the RL network (providing diverse samples), and RL data updates the EC memory (providing exploration)
Architecture
Workflow of the 2M agent illustrating the switching mechanism and data flow.
Evaluation Highlights
Outperforms or matches baselines (DQN, MFEC, EMDQN) across 5 MinAtar games
Data sharing increases Episodic Control return from ~3 to ~6 in ablation studies, proving RL helps EC escape local optima
Demonstrates faster initial learning than pure RL and better final convergence than pure Episodic Control
Breakthrough Assessment
7/10
Simple but effective framework combining two fundamental approaches. Strong empirical results on small benchmarks (MinAtar), but evaluation on larger suites (ALE) is missing.
⚙️ Technical Details
Problem Definition
Setting: Markov Decision Process (MDP)
Inputs: Current state s
Outputs: Action a to maximize expected cumulative reward
Pipeline Flow
Scheduler (selects memory type for episode)
Action Selector (EC or RL)
Environment (executes action, returns reward)
Shared Memory Update (Buffer + EC Table)
System Modules
Scheduler
Decides whether to use EC or RL for the current episode based on probability p_ec
Model or implementation: Decaying probability schedule (Exponential decay)
2M-EC (Episodic Memory)
Selects actions based on maximum episodic returns stored in a table (or neighbors)
Model or implementation: MFEC (Model-Free Episodic Control)
2M-RL (Parametric Memory)
Selects actions based on learned Q-values from a neural network
Model or implementation: DQN (Deep Q-Network)
Novel Architectural Elements
Dual-controller architecture where the active policy switches entirely between non-parametric (EC) and parametric (RL) systems per episode
Bidirectional data sharing where a parametric learner consumes traces from a non-parametric controller and vice-versa
Modeling
Base Model: Custom CNN for MinAtar (standard DQN architecture)
Training Method: Off-policy Reinforcement Learning (Q-learning) + Episodic Latch update
Objective Functions:
Purpose: Train the parametric RL agent.
Formally: Minimize expected squared error (y(s) - Q(s,a))^2 where y(s) is the one-step TD target.
Purpose: Update the non-parametric EC memory.
Formally: Q_ec(s,a) <- max(Q_ec(s,a), G_t) where G_t is the Monte-Carlo return.
vs. MFEC: 2M adds a parametric RL component to improve asymptotic performance and generalization
vs. EMDQN: 2M uses episodic memory for direct action selection (control), whereas EMDQN uses it only to enhance training targets
vs. NEC: 2M decouples the memory and network into two agents that can operate independently, rather than an integrated differentiable architecture
Limitations
Evaluated only on small-scale MinAtar games, not full Atari suite
Relies on handcrafted decay schedules for switching rather than adaptive mechanisms
Performance can degrade if EC data is heavily biased or misleading for the RL agent (observed in Asterix ablation)
Reproducibility
Code is stated to be available after notification but no URL is provided. MinAtar and WindyGrid are standard environments. Hyperparameters are listed in Table I.
📊 Experiments & Results
Evaluation Setup
RL on discrete control tasks (MinAtar games)
Benchmarks:
MinAtar (Discrete Control / Games)
WindyGrid (Tabular Navigation) [New]
Metrics:
Episodic Return (Score)
Sample Efficiency
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
Ablation study on Breakout/MinAtar demonstrating the impact of data sharing on Episodic Control (EC) performance.
MinAtar (Breakout/General)
Episodic Return
3
6
+3
Experiment Figures
Comparison on WindyGrid showing Q-value learning speed.
Main Takeaways
2M agent consistently matches or outperforms both pure RL (DQN) and pure Episodic Control (MFEC) across MinAtar games.
The 'Decayed' schedule (start with EC, switch to RL) outperforms constant or increasing schedules, validating the hypothesis that EC is better for early learning and RL for late refinement.
Data sharing is bidirectional: RL helps EC escape local optima (e.g., in Breakout), while EC provides high-return trajectories to speed up RL training (e.g., in WindyGrid).
In some games (Asterix), over-reliance on EC data can harm RL performance if the EC solution is locally optimal but risky, highlighting a trade-off in data mixing.