Evaluation Setup
Sequential fine-tuning on multiple datasets, evaluating on held-out test sets of all seen tasks after each stage.
Benchmarks:
- GSM8K-RFT (Elementary Math Reasoning)
- Competition Math (Advanced Math Reasoning)
- MMLU (General Knowledge)
- Alpaca-GPT4 (Instruction Following)
- SQuAD (Reading Comprehension)
Metrics:
- Average Forgetting (F)
- Exact Match Accuracy
- Token-level F1 (for SQuAD)
- Average Normalized Score
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| The paper provides general statements about MSSR outperforming baselines in Tables 1 and 2 but does not provide extractable numeric values in the text for the baselines or the proposed method. |
Main Takeaways
- MSSR_full consistently achieves the best performance across majority of datasets and backbones (Qwen, Gemma, Llama), indicating the synergy of sample-level and dataset-level scheduling.
- Sample-level prioritization (MSSR_spl) is generally more effective than just scheduling (MSSR_sch), but comes with higher compute cost.
- Accuracy-based replay is competitive but prohibitively expensive due to frequent validation; MSSR matches or beats it with much lower overhead.
- MSSR shows strong gains on reasoning-intensive benchmarks (GSM8K, MATH), suggesting complex skills benefit significantly from adaptive spacing.