Fine-tuning Reinforcement Learning Models is Secretly a Forgetting Mitigation Problem

📝 Paper Summary

Transfer Learning in Reinforcement Learning Catastrophic Forgetting

Fine-tuning pre-trained RL models fails because agents forget skills in unvisited state subspaces (FPC), a problem solvable by applying knowledge retention techniques like behavioral cloning.

Core Problem

In RL fine-tuning, the interplay between actions and observations causes agents to visit only a subset of states ('Close') early on, leading to the catastrophic forgetting of pre-trained capabilities in unvisited parts of the environment ('Far').

Why it matters:

Standard fine-tuning often leads to performance deterioration rather than improvement, negating the benefits of pre-training
Current approaches in supervised learning assume i.i.d. distributions, which fails in RL where the data distribution is non-stationary and dependent on the agent's current policy
Valuable capabilities (e.g., solving deeper game levels) are lost before the agent can re-explore those areas

Concrete Example: In RoboticSequence, a robot pre-trained to unplug a peg ('Far' task) but fine-tuned on a sequence starting with hammering ('Close' task) forgets how to unplug the peg by the time it relearns to hammer, resulting in a 0% success rate for the full sequence.

Key Novelty

Forgetting of Pre-trained Capabilities (FPC) Mitigation

Conceptualizes RL fine-tuning failure as a memory retention problem specifically caused by 'State Coverage Gap' (rarely visiting states where pre-training helps) and 'Imperfect Cloning Gap' (drift from expert behavior)
Demonstrates that standard continual learning methods (EWC, Behavioral Cloning, Kickstarting) are sufficient to fix this, treating fine-tuning as a forgetting mitigation task rather than just an exploration task

Architecture

Conceptual illustration of 'Close' and 'Far' state sets and how forgetting occurs during fine-tuning.

Evaluation Highlights

Achieves >10,000 points on NetHack (Human Monk), a 2x improvement over the previous state-of-the-art neural model (~5,000 points)
Solves all four stages of the RoboticSequence task 80% of the time using Behavioral Cloning, while vanilla fine-tuning collapses to near 0%
Maintains ~1.0 success rate in Montezuma's Revenge Room 7 (the 'Far' state) using knowledge retention, whereas vanilla fine-tuning drops significantly during early training

Breakthrough Assessment

8/10

Identifies a fundamental, overlooked cause of transfer failure in RL (FPC) and achieves a massive (2x) SOTA improvement on the difficult NetHack benchmark using simple, existing tools.

⚙️ Technical Details

Problem Definition

Setting: Fine-tuning a pre-trained policy π* on a downstream Markov Decision Process (MDP)

Inputs: Current state s

Outputs: Action a

Pipeline Flow

Pre-trained Policy Initialization (θ*)
Fine-tuning Loop: Interaction with Environment
Optimization: RL Loss + Knowledge Retention Auxiliary Loss

System Modules

Policy Network

Selects actions based on observations; initialized from pre-trained model

Model or implementation: Neural Network (Specifics vary by task, e.g., APPO for NetHack)

Knowledge Retention Mechanism

Calculates auxiliary loss to prevent deviation from pre-trained behavior

Model or implementation: Mathematical Objective (BC/KS/EWC)

Novel Architectural Elements

Integration of forgetting mitigation objectives (typically used in Continual Learning) directly into the standard Transfer RL fine-tuning loop

Modeling

Base Model: Varies by task (e.g., Tuyls et al. 2023 model for NetHack)

Training Method: Reinforcement Learning (PPO/APPO/SAC) with Auxiliary Retention Loss

Objective Functions:

Purpose: Regularize weight changes based on parameter importance.

Formally: L_aux(θ) = Σ F_i (θ*_i - θ_i)^2 (EWC)
Purpose: Replay-based retention using expert data.

Formally: L_BC(θ) = E_{s~B_BC} [KL(π*(s) || π_θ(s))] (Behavioral Cloning)
Purpose: Distillation on online data.

Formally: L_KS(θ) = E_{s~π_θ} [KL(π*(s) || π_θ(s))] (Kickstarting)

Key Hyperparameters:

NetHack_pretraining_transitions: 115 Billion
Montezuma_split_point: Room 7

Compute: Not reported in the paper

Comparison to Prior Work

vs. Tuyls et al.: Fine-tunes the clone with RL + Retention instead of just cloning
vs. Vanilla Fine-tuning: Explicitly adds auxiliary loss to prevent forgetting of 'Far' state capabilities
vs. Progressive Neural Networks: Does not expand architecture size; retains knowledge in same weights

Limitations

Choosing the right retention method (KS vs BC) depends on the specific type of gap (Imperfect Cloning vs State Coverage) and is not fully automated
Episodic Memory (EM) requires off-policy algorithms (like SAC), limiting applicability to on-policy settings (like PPO)
EWC consistently underperforms compared to replay/distillation methods in the tested environments

Reproducibility

Code: https://github.com/BartekCupial/finetuning-RL-as-CL

Code is publicly available. Pre-trained model for NetHack is sourced from Tuyls et al. (2023). Pre-training data for Montezuma/RoboticSequence is generated by the authors.

📊 Experiments & Results

Evaluation Setup

Fine-tuning pre-trained agents on downstream tasks with distinct state distributions

Benchmarks:

NetHack Learning Environment (Procedurally generated dungeon crawler (Human Monk))
Montezuma's Revenge (Atari game (Hard exploration))
RoboticSequence (Meta-World) (Sequential robotic manipulation) [New]

Metrics:

Average Score (NetHack)
Success Rate (RoboticSequence, Montezuma Room 7)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
NetHack (Human Monk)	Score	5555	10101	+4546
RoboticSequence	Success Rate	0	0.80	+0.80
Montezuma's Revenge (Room 7)	Success Rate	0.1	1.0	+0.9

Experiment Figures

Performance curves for NetHack, Montezuma's Revenge, and RoboticSequence comparing Vanilla FT, Scratch, and FT + Retention (BC/EWC/KS).

Log-likelihoods of expert trajectories on the 'Far' task (push-wall) during fine-tuning.

Main Takeaways

Vanilla fine-tuning in RL is not safe; it leads to rapid forgetting of capabilities in state spaces not immediately visited (State Coverage Gap).
Knowledge retention methods (BC, KS) are essential for transfer RL, enabling positive transfer where vanilla methods fail.
The choice of retention method matters: BC is better for State Coverage Gap (unseen states), while KS is efficient for Imperfect Cloning Gap (distribution shift on seen states).
In NetHack, retention allows the agent to improve upon the pre-trained expert, doubling the score, rather than just preventing degradation.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, SAC)
Catastrophic Forgetting / Continual Learning
KL Divergence

Key Terms

FPC: Forgetting of Pre-trained Capabilities—the phenomenon where an RL agent loses skills learned during pre-training because it doesn't visit the relevant states early in fine-tuning

State Coverage Gap: A specific instance of FPC where the agent operates in 'Close' states (start of task) and forgets how to act in 'Far' states (later in task) before reaching them

Imperfect Cloning Gap: An instance of FPC where slight differences between the pre-trained model and the optimal policy lead to distribution shift and subsequent forgetting

EWC: Elastic Weight Consolidation—a regularization method that penalizes changes to important network parameters (identified by the Fisher information matrix) to prevent forgetting

BC: Behavioral Cloning—in this context, an auxiliary loss that forces the policy to stay close to the pre-trained policy's output on a set of replay buffer states

KS: Kickstarting—a distillation method where the student policy is regularized to stay close to the teacher (pre-trained) policy on states visited by the student

APPO: Asynchronous Proximal Policy Optimization—an efficient, distributed version of the PPO reinforcement learning algorithm

SAC: Soft Actor-Critic—an off-policy RL algorithm that maximizes a trade-off between expected return and entropy

RND: Random Network Distillation—an exploration bonus method that encourages agents to visit unfamiliar states by predicting the output of a fixed random network