Two-Memory Reinforcement Learning

📝 Paper Summary

Episodic Control Reinforcement Learning

2M is a reinforcement learning agent that switches between fast episodic memory and slow parametric learning for action selection, sharing data between them to combine speed with asymptotic optimality.

Core Problem

Deep Reinforcement Learning (DRL) is sample-inefficient due to slow reward propagation and representation learning, while Episodic Memory (EM) learns quickly but struggles with generalization and stochasticity.

Why it matters:

Current DRL methods require massive amounts of interaction data, making them impractical for real-world tasks where samples are expensive
Pure episodic approaches (like MFEC) hit performance plateaus early because they lack the generalization capabilities of neural networks
Existing hybrid methods mostly use memory to estimate training targets rather than for direct control, failing to fully exploit the speed of episodic action selection

Concrete Example: In a simple grid world, an episodic agent quickly finds a path to a reward but gets stuck on a sub-optimal route because it lacks the look-ahead mechanism to correct itself. A standard RL agent eventually finds the optimal path but takes many more episodes to propagate the reward signal back to the start state.

Key Novelty

Dual-Memory Switching & Data Sharing

Maintain two distinct decision-making systems: a fast non-parametric Episodic Control (EC) memory and a slow parametric Reinforcement Learning (RL) network
Use a probabilistic schedule to select which memory controls the agent per episode, favoring EC early for speed and RL later for optimality
Share all collected experience between systems: EC data trains the RL network (providing diverse samples), and RL data updates the EC memory (providing exploration)

Architecture

Workflow of the 2M agent illustrating the switching mechanism and data flow.

Evaluation Highlights

Outperforms or matches baselines (DQN, MFEC, EMDQN) across 5 MinAtar games
Data sharing increases Episodic Control return from ~3 to ~6 in ablation studies, proving RL helps EC escape local optima
Demonstrates faster initial learning than pure RL and better final convergence than pure Episodic Control

Breakthrough Assessment

7/10

Simple but effective framework combining two fundamental approaches. Strong empirical results on small benchmarks (MinAtar), but evaluation on larger suites (ALE) is missing.

⚙️ Technical Details

Problem Definition

Setting: Markov Decision Process (MDP)

Inputs: Current state s

Outputs: Action a to maximize expected cumulative reward

Pipeline Flow

Scheduler (selects memory type for episode)
Action Selector (EC or RL)
Environment (executes action, returns reward)
Shared Memory Update (Buffer + EC Table)

System Modules

Scheduler

Decides whether to use EC or RL for the current episode based on probability p_ec

Model or implementation: Decaying probability schedule (Exponential decay)

2M-EC (Episodic Memory)

Selects actions based on maximum episodic returns stored in a table (or neighbors)

Model or implementation: MFEC (Model-Free Episodic Control)

2M-RL (Parametric Memory)

Selects actions based on learned Q-values from a neural network

Model or implementation: DQN (Deep Q-Network)

Novel Architectural Elements

Dual-controller architecture where the active policy switches entirely between non-parametric (EC) and parametric (RL) systems per episode
Bidirectional data sharing where a parametric learner consumes traces from a non-parametric controller and vice-versa

Modeling

Base Model: Custom CNN for MinAtar (standard DQN architecture)

Training Method: Off-policy Reinforcement Learning (Q-learning) + Episodic Latch update

Objective Functions:

Purpose: Train the parametric RL agent.

Formally: Minimize expected squared error (y(s) - Q(s,a))^2 where y(s) is the one-step TD target.
Purpose: Update the non-parametric EC memory.

Formally: Q_ec(s,a) <- max(Q_ec(s,a), G_t) where G_t is the Monte-Carlo return.

Key Hyperparameters:

learning_rate: 0.001 or 0.0001
epsilon_exploration: 0.1 or 0.9
k_neighbors: 1, 3, or 10
+ 1 more
p_start_to_p_end: 0.9 to 0.1 (decayed schedule)

Compute: Not reported in the paper

Comparison to Prior Work

vs. MFEC: 2M adds a parametric RL component to improve asymptotic performance and generalization
vs. EMDQN: 2M uses episodic memory for direct action selection (control), whereas EMDQN uses it only to enhance training targets
vs. NEC: 2M decouples the memory and network into two agents that can operate independently, rather than an integrated differentiable architecture

Limitations

Evaluated only on small-scale MinAtar games, not full Atari suite
Relies on handcrafted decay schedules for switching rather than adaptive mechanisms
Performance can degrade if EC data is heavily biased or misleading for the RL agent (observed in Asterix ablation)

Reproducibility

Code is stated to be available after notification but no URL is provided. MinAtar and WindyGrid are standard environments. Hyperparameters are listed in Table I.

📊 Experiments & Results

Evaluation Setup

RL on discrete control tasks (MinAtar games)

Benchmarks:

MinAtar (Discrete Control / Games)
WindyGrid (Tabular Navigation) [New]

Metrics:

Episodic Return (Score)
Sample Efficiency
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Ablation study on Breakout/MinAtar demonstrating the impact of data sharing on Episodic Control (EC) performance.
MinAtar (Breakout/General)	Episodic Return	3	6	+3

Experiment Figures

Comparison on WindyGrid showing Q-value learning speed.

Main Takeaways

2M agent consistently matches or outperforms both pure RL (DQN) and pure Episodic Control (MFEC) across MinAtar games.
The 'Decayed' schedule (start with EC, switch to RL) outperforms constant or increasing schedules, validating the hypothesis that EC is better for early learning and RL for late refinement.
Data sharing is bidirectional: RL helps EC escape local optima (e.g., in Breakout), while EC provides high-return trajectories to speed up RL training (e.g., in WindyGrid).
In some games (Asterix), over-reliance on EC data can harm RL performance if the EC solution is locally optimal but risky, highlighting a trade-off in data mixing.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning basics (Q-learning, TD error)
Episodic Control (non-parametric memory)
Experience Replay

Key Terms

EC: Episodic Control—a non-parametric method that stores the highest observed return for state-action pairs and uses it directly for decision making

2M: Two-Memory agent—the proposed framework combining EC and RL

MFEC: Model-Free Episodic Control—a specific EC algorithm using nearest neighbors to estimate values for unseen states

DQN: Deep Q-Network—a standard parametric RL algorithm using neural networks to approximate Q-values

EMDQN: Episodic Memory Deep Q-Network—a baseline method that uses episodic memory to improve DQN training targets

MinAtar: A miniature version of Atari games used for efficient RL benchmarking

Minibatches: Small subsets of data sampled from the replay buffer to train the neural network

One-step TD learning: Updating value estimates based only on the immediate reward and the value of the next state (standard Q-learning)