Evaluation Setup
Class-Incremental Learning (CIL) and Task-Incremental Learning (TIL) on text classification, intent classification, relation extraction, and NER.
Benchmarks:
- CLINC150 (Intent Classification)
- FewRel (Relation Extraction)
- Topic3 (Text Classification)
- Few-NERD (Named Entity Recognition)
Metrics:
- Average Accuracy (Avg. Acc)
- Forgetting Rate
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Probing results demonstrating that the backbone retains knowledge even when the model appears to forget. |
| CLINC150 (CIL) |
Accuracy |
10.0 |
95.0 |
+85.0
|
| Main comparison of SEQ* against SOTA methods on Class-Incremental Learning (CIL). |
| CLINC150 (CIL) |
Avg. Acc |
90.0 |
93.0 |
+3.0
|
| CLINC150 (CIL) |
Avg. Acc |
15.0 |
93.0 |
+78.0
|
| FewRel (CIL) |
Avg. Acc |
10.0 |
78.0 |
+68.0
|
Main Takeaways
- Catastrophic forgetting in PLMs is largely a 'classifier forgetting' problem; the backbone features remain robust.
- Linear probing is the most effective metric for measuring inherent knowledge retention in PLMs.
- Pre-training creates a feature space that is 'orthogonal' to learned word embeddings, which aids in anti-forgetting.
- SEQ* (freezing backbone + old classifiers) is a frustratingly simple yet SOTA-competitive baseline.