ARROW: Augmented Replay for RObust World models

📝 Paper Summary

Continual Reinforcement Learning (CRL) Model-Based Reinforcement Learning (MBRL) Experience Replay

ARROW extends the DreamerV3 world model with a dual-buffer replay system that balances recent experience with long-term distribution matching to mitigate catastrophic forgetting in continual learning.

Core Problem

Continual reinforcement learning agents suffer from catastrophic forgetting when learning sequential tasks, and existing model-based solutions often require prohibitively large replay buffers to maintain performance.

Why it matters:

Real-world agents must adapt to non-stationary environments where data is streamed and tasks do not reliably repeat
Standard FIFO buffers in World Models overwrite old experiences, leading to rapid degradation of previously learned skills
Existing solutions scale poorly because retaining complete experience histories demands large memory, limiting deployment on resource-constrained hardware

Concrete Example: In a sequence of Atari games, a standard agent trained on 'Enduro' and then 'Seaquest' will overwrite the 'Enduro' memories in its FIFO buffer. As a result, its ability to play 'Enduro' collapses while it learns 'Seaquest', failing to retain the earlier skill.

Key Novelty

Augmented Replay for RObust World models (ARROW)

Maintains two complementary replay buffers: a short-term FIFO buffer for plasticity (learning current tasks) and a long-term global distribution-matching buffer for stability (remembering past tasks)
Uses reservoir sampling to curate the long-term buffer, ensuring it retains a representative distribution of all past experiences without growing indefinitely
Splices experience episodes into fixed-length chunks to increase the diversity of trajectories stored within a limited memory budget

Architecture

The ARROW architecture integrating the World Model, Actor-Critic, and Augmented Replay Buffer.

Evaluation Highlights

Achieves 4x less forgetting on Atari tasks compared to DreamerV3 and SAC baselines with matched memory budgets
Maintains comparable forward transfer capabilities to baselines while significantly improving stability
Demonstrates robust performance on both diverse tasks (Atari) and tasks with shared structure (Procgen CoinRun variants) using only 2^19 total observations

Breakthrough Assessment

7/10

Successfully applies bio-inspired replay to modern World Models (DreamerV3) with strong empirical results on forgetting. While the components (reservoir sampling, dual buffers) are known, their integration into MBRL for efficient continual learning is novel and effective.

⚙️ Technical Details

Problem Definition

Setting: Continual Reinforcement Learning in Partially Observable Markov Decision Processes (POMDPs) without task identifiers

Inputs: Stream of observations (images) and rewards from sequential tasks

Outputs: Actions maximizing discounted returns across all tasks encountered so far

Pipeline Flow

Environment Interaction → Replay Buffer Storage
Replay Buffer Sampling (Short-term + Long-term) → World Model Training
World Model Imagination → Actor-Critic Training

System Modules

Short-term Buffer (D1) (Memory)

Stores the most recent experiences to ensure the model adapts quickly to the current task (plasticity)

Model or implementation: FIFO Queue

Long-term Buffer (D2) (Memory)

Preserves a diverse distribution of experiences across all tasks to prevent forgetting (stability)

Model or implementation: Reservoir Sampling Buffer

World Model

Learns dynamics from buffered data and generates imagined trajectories

Model or implementation: DreamerV3 (RSSM)

Actor-Critic

Learns policy and value functions from imagined data

Model or implementation: MLP Networks

Novel Architectural Elements

Dual-buffer architecture integrating a FIFO buffer and a Global Distribution Matching (Reservoir) buffer specifically for training a World Model
Splicing mechanism that chops continuous episodes into fixed chunks (T=512) to decouple memory management from episode length

Modeling

Base Model: DreamerV3 (RSSM with GRU + MLPs)

Training Method: Model-Based RL with Augmented Replay

Objective Functions:

Purpose: Train the World Model to reconstruct observations and rewards.

Formally: Variational lower bound (ELBO) including reconstruction loss and KL divergence between posterior and prior dynamics.
Purpose: Train the Actor to maximize expected returns.

Formally: REINFORCE gradients on imagined trajectories.
Purpose: Train the Critic to estimate value.

Formally: Regression to temporal difference targets on imagined trajectories.

Training Data:

Atari: Ms. Pac-Man, Boxing, Crazy Climber, Frostbite, Seaquest, Enduro
Procgen CoinRun: 6 variants with progressive perturbations (Big-background, Noise-background, etc.)

Key Hyperparameters:

buffer_capacity_D1: 2^18 observations
buffer_capacity_D2: 2^18 observations
spliced_sequence_length: 512
+ 3 more
latent_state_size: 32 discrete units x 32 classes
discount_gamma: Not explicitly reported in the paper (likely standard DreamerV3 default)
batch_size: Not explicitly reported in the paper

Compute: Single-GPU training (implied by 'DreamerV3 single-GPU benchmarks' context)

Comparison to Prior Work

vs. DreamerV3: Adds dual-buffer system (FIFO + Reservoir) vs. single FIFO buffer; matches total memory footprint
vs. TES-SAC: Model-based approach training on imagined data vs. model-free training on replay data; matches total memory footprint
vs. ER-based methods (e.g., in classification): Applies replay to World Model training rather than direct policy training [not cited in paper]

Limitations

Relies on fixed-entropy regularization for exploration, which may be insufficient for hard-exploration tasks without task IDs
Evaluation limited to visual tasks (Atari, CoinRun); applicability to other modalities not tested
Requires storage of high-dimensional observations (images), though mitigated by spliced rollouts

Reproducibility

Code: https://anonymous.4open.science/r/ARROW-B6F2/

Code is publicly available at https://anonymous.4open.science/r/ARROW-B6F2/. Standard Atari and Procgen environments are used. Exact hyperparameters for batch size and learning rates are not explicitly listed in the main text but likely follow DreamerV3 defaults.

📊 Experiments & Results

Evaluation Setup

Continual RL on sequences of tasks without task identifiers

Benchmarks:

Atari 2600 (Arcade Games (diverse dynamics/visuals))
Procgen CoinRun (Platformer (shared structure with perturbations)) [New]

Metrics:

Average Forgetting (Lower is better)
Forward Transfer (Higher is better)
Average Accuracy (ACC)
Worst-case Accuracy (WC-ACC)
Maximum Forgetting (Max-F)
Recovery
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Atari experiments demonstrate ARROW's superior stability compared to baselines with equal memory budgets.
Atari (Default Order)	Average Forgetting	0.55	0.14	-0.41
Atari (Default Order)	Average Forgetting	0.85	0.14	-0.71
Atari (Default Order)	Forward Transfer	-0.06	-0.05	+0.01
CoinRun experiments show ARROW maintains performance even when tasks share structure, though margins are tighter.
CoinRun (Default Order)	Average Forgetting	0.03	-0.01	-0.04
Two-cycle experiments highlight recovery and worst-case performance.
Atari (Two-cycle)	Max-F (Maximum Forgetting)	0.62	0.29	-0.33

Experiment Figures

Spider charts comparing ARROW, DreamerV3, and TES-SAC across multiple metrics (ACC, Forgetting, Forward Transfer, etc.) for Atari and CoinRun.

Main Takeaways

Strategic replay (FIFO + Reservoir) applied to World Models significantly reduces catastrophic forgetting compared to FIFO-only baselines
ARROW achieves these gains with a modest memory footprint (2^19 observations), matching the size of baseline buffers
Model-based approach (ARROW) consistently outperforms the model-free baseline (TES-SAC) in stability and sample efficiency for continual learning
Benefits are most pronounced in diverse tasks (Atari) where task interference is high; in structured tasks (CoinRun), backward transfer is observed

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (POMDPs, returns, policy)
Model-Based RL (World Models, RSSM)
Continual Learning concepts (Catastrophic Forgetting, Stability-Plasticity)

Key Terms

World Model: A learned internal simulator of the environment that predicts future states and rewards, used to train the agent via imagination

Catastrophic Forgetting: The tendency of neural networks to abruptly lose knowledge of previously learned tasks when trained on new information

Reservoir Sampling: A randomized algorithm to maintain a representative sample of a stream of unknown size using fixed memory

RSSM: Recurrent State-Space Model—a specific neural architecture used in Dreamer agents to model environment dynamics using deterministic and stochastic components

FIFO: First-In-First-Out—a buffer strategy that discards the oldest data to make room for new data

DreamerV3: A state-of-the-art model-based RL algorithm that masters diverse domains using a World Model and fixed hyperparameters

Spliced Rollouts: Long episodes cut into smaller fixed-length chunks to allow finer-grained management of storage and sampling

SAC: Soft Actor-Critic—a popular off-policy model-free RL algorithm that maximizes a trade-off between expected return and entropy

Forward Transfer: The ability of an agent to use knowledge from previous tasks to learn new tasks faster or better

Backward Transfer: The improvement in performance on previous tasks resulting from training on new tasks