Learn or Recall? Revisiting Incremental Learning with Pre-trained Language Models

📝 Paper Summary

Incremental Learning (IL) Catastrophic Forgetting

The paper demonstrates that Pre-trained Language Models do not suffer from catastrophic forgetting during sequential fine-tuning; rather, performance drops because the classification head's embeddings drift, which can be fixed by simple freezing strategies (SEQ*).

Core Problem

Current research assumes Pre-trained Language Models (PLMs) inherently suffer from catastrophic forgetting during Incremental Learning (IL), leading to complex methods that may underestimate the model's native abilities.

Why it matters:

Most existing IL methods are designed based on the false premise that PLMs forget old knowledge, leading to unnecessary complexity.
Simple baselines like Sequential Fine-tuning (SEQ) are widely regarded as weak lower bounds, but their failure modes are misunderstood.
Understanding where forgetting actually occurs (backbone vs. classifier) is crucial for efficient continual learning systems.

Concrete Example: In a Class-Incremental Learning setting on intent classification, a standard fine-tuned model's accuracy drops from ~98% to ~10% as new tasks are added. However, probing shows the backbone still retains the information to classify all tasks correctly; only the linear classifier head has 'forgotten' how to map features to old classes.

Key Novelty

SEQ* (Improved Sequential Fine-tuning)

Re-evaluates forgetting using linear probing, revealing that PLM backbones retain knowledge even after sequential fine-tuning on new tasks.
Identifies that 'forgetting' is primarily a classifier alignment issue: old class embeddings are pushed away from optimal positions while new ones dominate.
Proposes SEQ*: a simple method that freezes the PLM backbone after an initial warm-up and freezes old classifier heads, outperforming complex SOTA methods.

Architecture

Illustration of the Probing Assessment framework vs. Standard Observation.

Evaluation Highlights

On Class-Incremental Learning (CIL) for intent classification (CLINC150), SEQ* achieves ~93% accuracy with BERT-Large, outperforming SOTA method ELF (~90%) and standard SEQ (~15%).
Linear probing reveals PLM backbones maintain high accuracy (near 100% on some tasks) throughout sequential training, contradicting the catastrophic forgetting hypothesis.
SEQ* reduces trainable parameters significantly compared to expansion-based methods while matching or beating replay-based methods without storing old data.

Breakthrough Assessment

7/10

Challenge fundamental assumptions about catastrophic forgetting in PLMs. While the proposed method (SEQ*) is simple, the insight that 'forgetting is in the head, not the body' is a significant conceptual correction for the field.

⚙️ Technical Details

Problem Definition

Setting: Incremental Learning (IL) on a sequence of tasks D = {D_1, ..., D_T}, specifically Class-Incremental Learning (CIL) and Task-Incremental Learning (TIL).

Inputs: Input samples x from the current task D_t.

Outputs: Predicted label y from the cumulative label set of all seen tasks (CIL) or the current task's label set (TIL).

Pipeline Flow

Task 1: Warm-up (Full Fine-tuning)
Task t > 1: Freeze Backbone & Old Classifiers, Train New Classifier Heads

System Modules

Backbone PLM

Extract semantic features from input text.

Model or implementation: BERT-base/large (Encoder) or GPT-2/Pythia (Decoder)

Classifier Head

Map features to class logits.

Model or implementation: Linear Layer or Cosine Linear Layer

Novel Architectural Elements

Freezing strategy: Unlike standard SEQ which fine-tunes everything, SEQ* explicitly freezes the backbone after the first task and freezes old classifier weights to prevent embedding displacement.

Modeling

Base Model: BERT-base-cased, BERT-large-cased, GPT-2, Pythia (70M to 1.4B)

Training Method: Sequential Fine-tuning (SEQ) and SEQ* (Freezing)

Objective Functions:

Purpose: Classification.

Formally: Cross-entropy loss over the current task's data.

Adaptation: SEQ* freezes backbone after Task 1; freezes old heads.

Trainable Parameters: Only the classifier head for the current task (after Task 1)

Training Data:

Datasets: Topic3 (Text Class.), CLINC150/Banking77 (Intent), FewRel/TACRED (Relation Extract.), OntoNotes5/I2B2/Few-NERD (NER)

Key Hyperparameters:

learning_rate: Not explicitly reported in the paper
batch_size: Not explicitly reported in the paper
warmup_epochs: 1-3 epochs (Task 1)

Compute: Requires significantly less training time and trainable parameters than SOTA IL methods (SEQ* freezes most parameters).

Comparison to Prior Work

vs. EWC/LwF: SEQ* does not use regularization terms, relying instead on freezing to preserve representations.
vs. DER++/ELF: SEQ* does not require a replay buffer (storing old data), though it can be combined with one.
vs. Standard SEQ: SEQ* freezes the backbone and old heads, whereas SEQ updates all parameters leading to classifier drift.

Limitations

SEQ* relies heavily on the assumption that the first task provides a sufficiently general feature space (warm-up is critical).
Does not continuously update the backbone, potentially limiting adaptation if downstream tasks are radically different from the pre-training/first-task distribution.
Requires knowing task boundaries (to freeze/unfreeze specific heads), which is standard in CIL/TIL but a limitation for task-free settings.

Reproducibility

Code: https://github.com/zzz47zzz/codebase-for-incremental-learning-with-llm

Code is publicly available at https://github.com/zzz47zzz/codebase-for-incremental-learning-with-llm. Specific hyperparameters (LR, batch size) are not detailed in the main text but the code is linked.

📊 Experiments & Results

Evaluation Setup

Class-Incremental Learning (CIL) and Task-Incremental Learning (TIL) on text classification, intent classification, relation extraction, and NER.

Benchmarks:

CLINC150 (Intent Classification)
FewRel (Relation Extraction)
Topic3 (Text Classification)
Few-NERD (Named Entity Recognition)

Metrics:

Average Accuracy (Avg. Acc)
Forgetting Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Probing results demonstrating that the backbone retains knowledge even when the model appears to forget.
CLINC150 (CIL)	Accuracy	10.0	95.0	+85.0
Main comparison of SEQ* against SOTA methods on Class-Incremental Learning (CIL).
CLINC150 (CIL)	Avg. Acc	90.0	93.0	+3.0
CLINC150 (CIL)	Avg. Acc	15.0	93.0	+78.0
FewRel (CIL)	Avg. Acc	10.0	78.0	+68.0

Experiment Figures

Comparison of Observed Performance vs. Probing Performance on CLINC150.

Analysis of why the classifier fails: Class Embedding Norms and Moving Distances.

Main Takeaways

Catastrophic forgetting in PLMs is largely a 'classifier forgetting' problem; the backbone features remain robust.
Linear probing is the most effective metric for measuring inherent knowledge retention in PLMs.
Pre-training creates a feature space that is 'orthogonal' to learned word embeddings, which aids in anti-forgetting.
SEQ* (freezing backbone + old classifiers) is a frustratingly simple yet SOTA-competitive baseline.

📚 Prerequisite Knowledge

Prerequisites

Incremental Learning / Continual Learning
Transfer Learning with Pre-trained Language Models (BERT, GPT)
Linear Probing vs. Fine-tuning

Key Terms

Catastrophic Forgetting: The tendency of neural networks to drastically lose performance on previously learned tasks when trained on new ones.

SEQ: Sequential Fine-tuning—training a model on tasks one by one without any specific anti-forgetting mechanisms.

SEQ*: The proposed method: Sequential Fine-tuning with specific freezing strategies (freeze backbone after warm-up, freeze old classifiers) to prevent classifier drift.

CIL: Class-Incremental Learning—tasks have disjoint label sets, and the model must classify inputs without knowing which task they belong to.

TIL: Task-Incremental Learning—tasks may have overlapping label sets, but the task ID is provided during inference.

Linear Probing: Evaluating a model's representation quality by freezing the backbone and training a linear classifier on top using all data.

Probing Performance: The upper bound performance achievable if the classifier did not forget, used to measure inherent knowledge retention in the backbone.