Evaluation Setup
Pre-train models on vanilla vs. perturbed data, then fine-tune on downstream tasks.
Benchmarks:
- LAMA (Knowledge Probing)
- GLUE (Language Understanding)
- CoNLL03 / OntoNotes (Named Entity Recognition)
- ACE04 / ACE05 (Relation Extraction)
- Natural Questions / CosmosQA / FEVER (Knowledge Applying (QA / Fact Checking))
Metrics:
- P@1 (LAMA)
- F1 score (NER/RE)
- Accuracy (GLUE/QA)
- Statistical methodology: t-test to examine significance of performance differences; threshold 0.05
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Knowledge Probing (LAMA) confirms that perturbation successfully destroys the model's factual knowledge. |
| LAMA |
P@1 |
28.18 |
11.62 |
-16.56
|
| Downstream tasks show negligible difference between models trained on correct vs. incorrect knowledge. |
| GLUE (Avg) |
Score |
80.26 |
80.07 |
-0.19
|
| CoNLL2003 |
F1 |
91.37 |
91.22 |
-0.15
|
| ACE2005 |
F1 |
72.93 |
73.12 |
+0.19
|
| Natural Questions |
Exact Match |
50.36 |
50.38 |
+0.02
|
| FewRel |
Accuracy |
88.41 |
86.88 |
-1.53
|
Main Takeaways
- Correctness of injected factual knowledge has very limited effect on downstream task performance across NLU, NER, RE, and QA tasks.
- Performance fluctuations caused by random seeds (e.g., 0.33% on GLUE) were often larger than fluctuations caused by injecting wrong knowledge (0.19%).
- Even Ontological Substitution (changing 'Person' to 'Location') did not significantly degrade performance on Entity Typing or NER tasks, suggesting models rely on local context or superficial cues rather than deep ontological knowledge.
- Previous claims that 'factual knowledge injection' drives performance gains are likely conflating 'knowledge' with other factors like domain adaptation or regularization.