MSSR mitigates catastrophic forgetting in continual LLM fine-tuning by modeling sample-level retention dynamics and adaptively scheduling replay based on the Ebbinghaus forgetting curve.
Core Problem
Continual fine-tuning of LLMs causes catastrophic forgetting, and existing replay strategies are either heuristic (fixed intervals), reactive (wait for loss spikes), or computationally expensive (frequent evaluation).
Why it matters:
LLMs deployed in dynamic environments must acquire new knowledge without degrading previously learned skills (e.g., in healthcare or law).
Current methods inadequately model the temporal heterogeneity of forgetting, often assuming uniform replay needs across time.
Scalability is limited because monitoring overhead for accuracy-based replay becomes prohibitive in long training runs.
Concrete Example:In a sequence like Alpaca → GSM8K → Math, a model trained on Math might forget basic instruction following (Alpaca). Fixed replay wastes compute on stable samples, while accuracy-based replay only triggers after performance has already dropped significantly.
Key Novelty
Memory-Inspired Sampler and Scheduler Replay (MSSR)
Models each data sample's 'memory strength' as a decaying value that increases with replay and decays over time, inspired by the Ebbinghaus forgetting curve.
Replay intervals expand over time (spacing effect): replay is frequent initially when forgetting is rapid, and becomes sparser as memory stabilizes.
Prioritizes replay for samples with lower memory strength (higher forgetting risk) rather than random selection.
Architecture
The MSSR framework workflow, illustrating the closed-loop interaction between memory tracking, replay scheduling, and fine-tuning.
Evaluation Highlights
Achieves strongest consistent performance across 3 backbone models (Qwen2.5, Gemma2, Llama-3.1) on sequential reasoning tasks.
Outperforms fixed and accuracy-based replay baselines on the 11-task long-sequence benchmark while reducing computational overhead.
Effectively mitigates early-task forgetting in long sequences compared to reactive baselines.
Breakthrough Assessment
7/10
Offers a principled, theoretically grounded alternative to heuristic replay. While the core concept (Ebbinghaus) is established in psychology, applying it to sample-level scheduling for LLM continual learning is a solid methodological contribution.
⚙️ Technical Details
Problem Definition
Setting: Continual fine-tuning on a sequence of datasets {D1, D2, ..., DT} where the model must learn Dt while minimizing forgetting on previous Di.
Inputs: Sequence of task datasets Dt
Outputs: A single model Ft capable of performing all tasks seen so far.
Pipeline Flow
Sample Memory Tracking (updates memory strength based on loss)
Replay Scheduler (determines when and how much to replay)
LoRA-based Optimization (updates model on mixed data)
System Modules
Sample Memory Tracker (Memory Management)
Updates per-sample memory strength m_{i,t} and stability S_{i,t} based on observed loss and elapsed time.
Model or implementation: Analytical decay model (Eq. 4 in paper)
Replay Scheduler (Memory Management)
Calculates replay probability for each sample and determines the global replay ratio.
Model or implementation: Probabilistic sampler
Optimization Engine
Updates model parameters using joint loss from current task and replay data.
Model or implementation: LoRA (Low-Rank Adaptation)
Novel Architectural Elements
Integration of an analytical memory decay model (based on Ebbinghaus curve) directly into the data sampling loop of an LLM trainer.
Modeling
Base Model: Qwen2.5-7B (primary), Gemma2-9B, Llama-3.1-8B, Mistral-7B-v0.3
Training Method: Continual Fine-Tuning with Experience Replay
Objective Functions:
Purpose: Jointly optimize performance on current task and retention of past tasks.
Code is available at https://github.com/YiyangLu/MSSR. Hyperparameters like learning rate and batch size are mentioned as being in Appendix E but specific values are not in the main text provided. Built on LLaMA-Factory.
📊 Experiments & Results
Evaluation Setup
Sequential fine-tuning on multiple datasets, evaluating on held-out test sets of all seen tasks after each stage.
Benchmarks:
GSM8K-RFT (Elementary Math Reasoning)
Competition Math (Advanced Math Reasoning)
MMLU (General Knowledge)
Alpaca-GPT4 (Instruction Following)
SQuAD (Reading Comprehension)
Metrics:
Average Forgetting (F)
Exact Match Accuracy
Token-level F1 (for SQuAD)
Average Normalized Score
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
The paper provides general statements about MSSR outperforming baselines in Tables 1 and 2 but does not provide extractable numeric values in the text for the baselines or the proposed method.
Experiment Figures
Conceptual comparison of replay strategies: Fixed (uniform), Reactive (spike-driven), and MSSR (expanding intervals).
Main Takeaways
MSSR_full consistently achieves the best performance across majority of datasets and backbones (Qwen, Gemma, Llama), indicating the synergy of sample-level and dataset-level scheduling.
Sample-level prioritization (MSSR_spl) is generally more effective than just scheduling (MSSR_sch), but comes with higher compute cost.
Accuracy-based replay is competitive but prohibitively expensive due to frequent validation; MSSR matches or beats it with much lower overhead.
MSSR shows strong gains on reasoning-intensive benchmarks (GSM8K, MATH), suggesting complex skills benefit significantly from adaptive spacing.
📚 Prerequisite Knowledge
Prerequisites
Continual Learning / Lifelong Learning
Experience Replay / Rehearsal
Parameter-Efficient Fine-Tuning (LoRA)
Ebbinghaus Forgetting Curve
Key Terms
Experience Replay: A technique where a subset of old data is mixed with new data during training to prevent the model from forgetting past tasks.
Catastrophic Forgetting: The phenomenon where a neural network abruptly loses previously learned information upon learning new information.
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices.
Ebbinghaus Forgetting Curve: A hypothesis that memory retention declines over time unless the information is reviewed, with the rate of decay decreasing after each review.
EMA: Exponential Moving Average—used here to denoise loss values for stable memory strength estimation.