MSSR: Memory-Aware Adaptive Replay for Continual LLM Fine-Tuning

📝 Paper Summary

Continual Learning Catastrophic Forgetting Mitigation

MSSR mitigates catastrophic forgetting in continual LLM fine-tuning by modeling sample-level retention dynamics and adaptively scheduling replay based on the Ebbinghaus forgetting curve.

Core Problem

Continual fine-tuning of LLMs causes catastrophic forgetting, and existing replay strategies are either heuristic (fixed intervals), reactive (wait for loss spikes), or computationally expensive (frequent evaluation).

Why it matters:

LLMs deployed in dynamic environments must acquire new knowledge without degrading previously learned skills (e.g., in healthcare or law).
Current methods inadequately model the temporal heterogeneity of forgetting, often assuming uniform replay needs across time.
Scalability is limited because monitoring overhead for accuracy-based replay becomes prohibitive in long training runs.

Concrete Example: In a sequence like Alpaca → GSM8K → Math, a model trained on Math might forget basic instruction following (Alpaca). Fixed replay wastes compute on stable samples, while accuracy-based replay only triggers after performance has already dropped significantly.

Key Novelty

Memory-Inspired Sampler and Scheduler Replay (MSSR)

Models each data sample's 'memory strength' as a decaying value that increases with replay and decays over time, inspired by the Ebbinghaus forgetting curve.
Replay intervals expand over time (spacing effect): replay is frequent initially when forgetting is rapid, and becomes sparser as memory stabilizes.
Prioritizes replay for samples with lower memory strength (higher forgetting risk) rather than random selection.

Architecture

The MSSR framework workflow, illustrating the closed-loop interaction between memory tracking, replay scheduling, and fine-tuning.

Evaluation Highlights

Achieves strongest consistent performance across 3 backbone models (Qwen2.5, Gemma2, Llama-3.1) on sequential reasoning tasks.
Outperforms fixed and accuracy-based replay baselines on the 11-task long-sequence benchmark while reducing computational overhead.
Effectively mitigates early-task forgetting in long sequences compared to reactive baselines.

Breakthrough Assessment

7/10

Offers a principled, theoretically grounded alternative to heuristic replay. While the core concept (Ebbinghaus) is established in psychology, applying it to sample-level scheduling for LLM continual learning is a solid methodological contribution.

⚙️ Technical Details

Problem Definition

Setting: Continual fine-tuning on a sequence of datasets {D1, D2, ..., DT} where the model must learn Dt while minimizing forgetting on previous Di.

Inputs: Sequence of task datasets Dt

Outputs: A single model Ft capable of performing all tasks seen so far.

Pipeline Flow

Sample Memory Tracking (updates memory strength based on loss)
Replay Scheduler (determines when and how much to replay)
LoRA-based Optimization (updates model on mixed data)

System Modules

Sample Memory Tracker (Memory Management)

Updates per-sample memory strength m_{i,t} and stability S_{i,t} based on observed loss and elapsed time.

Model or implementation: Analytical decay model (Eq. 4 in paper)

Replay Scheduler (Memory Management)

Calculates replay probability for each sample and determines the global replay ratio.

Model or implementation: Probabilistic sampler

Optimization Engine

Updates model parameters using joint loss from current task and replay data.

Model or implementation: LoRA (Low-Rank Adaptation)

Novel Architectural Elements

Integration of an analytical memory decay model (based on Ebbinghaus curve) directly into the data sampling loop of an LLM trainer.

Modeling

Base Model: Qwen2.5-7B (primary), Gemma2-9B, Llama-3.1-8B, Mistral-7B-v0.3

Training Method: Continual Fine-Tuning with Experience Replay

Objective Functions:

Purpose: Jointly optimize performance on current task and retention of past tasks.

Formally: L(theta) = L_task(D_new; theta) + lambda * L_replay(B_replay; theta)

Adaptation: LoRA (Low-Rank Adaptation)

Trainable Parameters: LoRA adapters only

Training Data:

Sequence 1: Alpaca-GPT4 -> GSM8K-RFT -> Competition Math
Sequence 2: 11 distinct tasks including AGNews, SQuAD, SciQ, BoolQ, ARC, Math subsets

Key Hyperparameters:

optimizer: AdamW
learning_rate: Not explicitly reported in the paper
batch_size: Not explicitly reported in the paper
+ 1 more
LoRA_rank: Not explicitly reported in the paper

Compute: NVIDIA A100 GPUs (80 GB)

Comparison to Prior Work

vs. Fixed Replay: MSSR uses adaptive intervals and prioritization, reducing waste on stable samples.
vs. Accuracy-based Replay: MSSR is proactive (predicts forgetting) rather than reactive, avoiding the cost of frequent validation.
vs. O-LoRA [not cited in paper]: O-LoRA learns orthogonal subspaces to avoid interference; MSSR focuses on data scheduling and can be complementary.

Limitations

Relies on the assumption that LLM forgetting dynamics mirror human memory (Ebbinghaus), which is a heuristic.
Computational overhead of tracking per-sample memory states, though approximated via piecewise-constant hazard.
Effectiveness depends on the accuracy of the memory strength estimation (hyperparameters alpha, gamma).

Reproducibility

Code: https://github.com/YiyangLu/MSSR

Code is available at https://github.com/YiyangLu/MSSR. Hyperparameters like learning rate and batch size are mentioned as being in Appendix E but specific values are not in the main text provided. Built on LLaMA-Factory.

📊 Experiments & Results

Evaluation Setup

Sequential fine-tuning on multiple datasets, evaluating on held-out test sets of all seen tasks after each stage.

Benchmarks:

GSM8K-RFT (Elementary Math Reasoning)
Competition Math (Advanced Math Reasoning)
MMLU (General Knowledge)
Alpaca-GPT4 (Instruction Following)
SQuAD (Reading Comprehension)

Metrics:

Average Forgetting (F)
Exact Match Accuracy
Token-level F1 (for SQuAD)
Average Normalized Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The paper provides general statements about MSSR outperforming baselines in Tables 1 and 2 but does not provide extractable numeric values in the text for the baselines or the proposed method.

Experiment Figures

Conceptual comparison of replay strategies: Fixed (uniform), Reactive (spike-driven), and MSSR (expanding intervals).

Main Takeaways

MSSR_full consistently achieves the best performance across majority of datasets and backbones (Qwen, Gemma, Llama), indicating the synergy of sample-level and dataset-level scheduling.
Sample-level prioritization (MSSR_spl) is generally more effective than just scheduling (MSSR_sch), but comes with higher compute cost.
Accuracy-based replay is competitive but prohibitively expensive due to frequent validation; MSSR matches or beats it with much lower overhead.
MSSR shows strong gains on reasoning-intensive benchmarks (GSM8K, MATH), suggesting complex skills benefit significantly from adaptive spacing.

📚 Prerequisite Knowledge

Prerequisites

Continual Learning / Lifelong Learning
Experience Replay / Rehearsal
Parameter-Efficient Fine-Tuning (LoRA)
Ebbinghaus Forgetting Curve

Key Terms

Experience Replay: A technique where a subset of old data is mixed with new data during training to prevent the model from forgetting past tasks.

Catastrophic Forgetting: The phenomenon where a neural network abruptly loses previously learned information upon learning new information.

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices.

Ebbinghaus Forgetting Curve: A hypothesis that memory retention declines over time unless the information is reviewed, with the rate of decay decreasing after each review.

EMA: Exponential Moving Average—used here to denoise loss values for stable memory strength estimation.