Memory Retention Is Not Enough to Master Memory Tasks in Reinforcement Learning

📝 Paper Summary

Memory organization Memory recall

The paper introduces two diagnostic benchmarks to demonstrate that standard RL memory architectures (like Transformers) excel at retention but fail at selective rewriting, whereas simple recurrent models with explicit forgetting gates perform better.

Core Problem

Existing memory-augmented RL agents and benchmarks focus primarily on information retention (keeping data), neglecting the critical ability of rewriting (selectively discarding outdated data and integrating new evidence).

Why it matters:

Real-world environments are dynamic; acting on obsolete information (e.g., old navigation cues) leads to failure.
Current benchmarks typically reward holding a single cue until the end of an episode, failing to penalize agents that cannot discard irrelevant history.
Transformer-based agents often lack explicit forgetting mechanisms, causing them to fail in tasks requiring continual updates despite their popularity.

Concrete Example: In the 'Endless T-Maze', an agent receives a cue (e.g., 'turn left') at the start of a corridor. After the turn, a new cue appears (e.g., 'turn right'). A retention-focused agent keeps the old 'left' cue and fails the next junction, whereas a rewriting-capable agent discards 'left' and stores 'right'.

Key Novelty

Diagnostic Benchmarks for Memory Rewriting

Introduces 'Endless T-Maze' and 'Color-Cubes': environments specifically designed to force agents to overwrite old memory states with new observations to succeed.
Endless T-Maze requires overwriting navigation instructions at every corridor junction, preventing success via static retention.
Color-Cubes requires tracking object locations that stochastically teleport, forcing agents to detect inconsistencies between memory and observation and update their internal map.

Architecture

A conceptual comparison between Memory Rewriting and Retention-Only strategies.

Evaluation Highlights

Recurrent models (PPO-LSTM) outperform Transformer-based agents (GTrXL) on the Endless T-Maze, with LSTM maintaining high rewards while GTrXL performance collapses as corridor length increases.
In Color-Cubes, LSTM agents achieve significantly higher success rates than GTrXL, which struggles to update memory after object teleportation.
Structured memory (SHM) fails to generalize in rewriting tasks, performing worse than standard LSTM baselines.

Breakthrough Assessment

7/10

Identifies a fundamental blind spot in current RL memory research (rewriting vs. retention) and provides targeted benchmarks, though it primarily evaluates existing architectures rather than proposing a novel model.

⚙️ Technical Details

Problem Definition

Setting: Partially Observable Markov Decision Process (POMDP) where the optimal policy requires continually updating a memory state m_t based on new observations.

Inputs: Observation o_t (partial view of environment), previous memory state m_{t-1}

Outputs: Action a_t, updated memory state m_t

Pipeline Flow

Observation Encoding (CNN/MLP)
Memory Processing (LSTM / Transformer / Structured Memory)
Policy/Value Heads (Actor-Critic)

System Modules

Encoder

Process visual or vector observations into a latent representation

Model or implementation: 3-layer CNN (for visual) or MLP (for vector)

Memory Core

Maintain and update hidden state based on history

Model or implementation: Variable (LSTM, GTrXL, or SHM)

Heads

Output action probabilities and value estimates

Model or implementation: MLP (Actor and Critic heads)

Novel Architectural Elements

No novel architecture proposed; the paper contributes novel diagnostic environments (Endless T-Maze, Color-Cubes) to test existing architectures.

Modeling

Base Model: Comparison of PPO-LSTM, GTrXL, and SHM

Training Method: Proximal Policy Optimization (PPO)

Objective Functions:

Purpose: Maximize expected return while ensuring stable policy updates.

Formally: Standard PPO clipped surrogate objective.
Purpose: Minimize error in value estimation.

Formally: Mean Squared Error between value estimate and returns.
Purpose: Encourage exploration.

Formally: Entropy bonus.

Adaptation: End-to-end training of memory and policy

Training Data:

Procedurally generated episodes in Endless T-Maze and Color-Cubes

Key Hyperparameters:

learning_rate: Not explicitly reported in the paper
gamma: Not explicitly reported in the paper
clip_epsilon: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. PPO-LSTM: Evaluating if simple gating (forget gate) is sufficient for rewriting compared to complex attention.
vs. GTrXL: Testing if attention mechanisms can handle destructive updates (overwriting) or if they bias towards retention.
vs. SHM: Testing if structured write operations provide better stability for rewriting than gating.

Limitations

The paper focuses on diagnostic tasks rather than complex photorealistic environments.
Evaluation is limited to PPO; other RL algorithms (e.g., off-policy methods) are not tested.
Hyperparameters for the training algorithm are not detailed in the provided text.

Reproducibility

Code: https://quartz-admirer.github.io/Memory-Rewriting/

Code is publicly available at https://quartz-admirer.github.io/Memory-Rewriting/. The paper defines the environment logic clearly. Hyperparameters for PPO are not explicitly listed in the main text provided.

📊 Experiments & Results

Evaluation Setup

Reinforcement Learning in procedurally generated diagnostic environments.

Benchmarks:

Endless T-Maze (Continuous navigation with sequential cue overwriting) [New]
Color-Cubes (Grid-world object tracking with stochastic teleportation) [New]

Metrics:

Success Rate
Average Reward
Episode Length (survival time)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results on Endless T-Maze showing the impact of memory rewriting requirements.
Endless T-Maze	Performance Stability	Failure	Success	Qualitative
Results on Color-Cubes varying difficulty levels.
Color-Cubes (Medium/Extreme)	Success Rate	Low	High	Qualitative

Experiment Figures

Diagram of the Endless T-Maze environment.

Diagram of the Color-Cubes environment.

Main Takeaways

Recurrent models (LSTM) with explicit forget gates are surprisingly more effective at memory rewriting than Transformers (GTrXL) or Structured Memory (SHM).
Transformers struggle to selectively discard outdated information, often failing in tasks that require overwriting previous cues.
Structured memories (SHM) are brittle and succeed only under narrow conditions, failing to generalize to dynamic rewriting tasks.
Adaptive forgetting is a critical design requirement for RL agents in non-stationary POMDPs, distinct from long-context retention.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (MDPs, POMDPs)
Recurrent Neural Networks (LSTM)
Transformers in RL (Transformer-XL)
Memory-augmented neural networks

Key Terms

POMDP: Partially Observable Markov Decision Process—a scenario where the agent cannot see the full state of the world and must make decisions based on incomplete observations and memory.

Memory rewriting: The process of selectively discarding outdated information from memory and replacing it with new, relevant information.

PPO: Proximal Policy Optimization—a popular reinforcement learning algorithm used for training the agents.

LSTM: Long Short-Term Memory—a type of recurrent neural network capable of learning long-term dependencies, equipped with gating mechanisms for forgetting.

GTrXL: Gated Transformer-XL—a transformer architecture adapted for RL that uses gating to stabilize training and memory caching for long contexts.

SHM: Stable Hadamard Memory—a structured memory architecture that uses matrix operations to store and retrieve information.

Endless T-Maze: A navigation task consisting of an infinite sequence of corridors where directional cues change at every junction, requiring continual memory updates.

Color-Cubes: A grid-world task where agents must find specific colored cubes that may teleport, requiring the agent to update its internal map of object locations.