Evaluation Setup
Pre-training from scratch, continued pre-training, and fine-tuning scenarios
Benchmarks:
- BLiMP (Syntactic acceptability (minimal pairs))
- SyntaxGym (Syntactic generalization (surprisal constraints))
- WikiText-103 (Language Modeling (Perplexity))
- HANS (Adversarial NLI)
Metrics:
- Perplexity (PPL)
- Accuracy (BLiMP)
- SG Score (SyntaxGym)
- Accuracy (HANS/MultiNLI)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Question Formation (QF) |
Accuracy |
42.5 |
100.0 |
+57.5
|
| SyntaxGym |
SG Score |
73.2 |
82.7 |
+9.5
|
| WikiText-103 |
Perplexity |
46.20 |
41.97 |
-4.23
|
| HANS (Adversarial NLI) |
Accuracy |
15.0 |
56.2 |
+41.2
|
Main Takeaways
- TreeReg consistently improves syntactic generalization (BLiMP, SyntaxGym) across model scales and training regimes
- Regularizing for syntax improves out-of-distribution perplexity (WikiText-103), suggesting syntax is a robust feature for general language modeling
- The method is data-efficient: TreeReg LMs outperform standard LMs trained on 2x more data
- TreeReg mitigates 'catastrophic forgetting' of syntax or reliance on spurious heuristics (HANS) during fine-tuning