Does the Correctness of Factual Knowledge Matter for Factual Knowledge-Enhanced Pre-trained Language Models?

📝 Paper Summary

Factual Knowledge Injection Interpretability & Analysis Knowledge Internalization

By pre-training models on massive factually perturbed data (replacing entities with incorrect ones), this study reveals that the correctness of injected knowledge has negligible causal impact on downstream task performance.

Core Problem

Previous studies observe a correlation between factual knowledge injection and improved downstream performance, but have not established causality; improvements might stem from confounding factors rather than the knowledge itself.

Why it matters:

If knowledge correctness doesn't matter, current research directions focusing on high-quality knowledge injection might be misguided
Performance gains attributed to 'knowledge' might actually come from data domain, model size, or linguistic exposure
Understanding what PLMs actually learn during knowledge injection is crucial for designing reliable AI systems

Concrete Example: A model trained on 'Bill Gates is the CEO of Apple' (perturbed) performs statistically identically to one trained on 'Tim Cook is the CEO of Apple' (correct) on Named Entity Recognition tasks, suggesting the specific fact didn't drive the capability.

Key Novelty

Counterfactual Knowledge Perturbation Analysis

Invert the standard evaluation: instead of trying to improve models, deliberately break the factual knowledge in the pre-training data at scale (up to 93% perturbation)
Train models from scratch on this 'wrong' data using standard injection methods (masked modeling, entity embedding, adapter supervision)
Compare downstream performance: if knowledge matters, the 'wrong' model should fail; if it performs equally well, the knowledge wasn't the cause of improvement

Architecture

The counterfactual analysis framework flow: from data perturbation to pre-training to evaluation

Evaluation Highlights

No statistically significant difference (p > 0.05) in performance between vanilla and factually perturbed BERT models on 6 out of 7 downstream benchmarks (e.g., GLUE, SQuAD, NER)
Perturbing 93% of Wikipedia facts caused LAMA knowledge probing accuracy to drop from 28.18% to 11.62%, yet downstream GLUE scores remained stable (80.26 vs 80.07)
Even ontological perturbation (swapping entities with wrong types, e.g., Person → Location) failed to significantly degrade performance on most tasks, including entity typing

Breakthrough Assessment

8/10

A negative result that is highly significant. It challenges the fundamental assumption of a major subfield (knowledge injection), effectively showing that 'knowledge' gains are likely spurious correlations.

⚙️ Technical Details

Problem Definition

Setting: Pre-training language models on text or knowledge graphs where factual statements <subject, relation, object> are systematically perturbed

Inputs: Corpus D or Knowledge Base K with perturbed facts

Outputs: Pre-trained Language Model parameters θ

Pipeline Flow

Data Perturbation (Factual or Ontological)
Knowledge Injection Pre-training (BERT / ERNIE / K-Adapter)
Downstream Fine-tuning & Evaluation

System Modules

Perturbation Engine

Generate counterfactual training data by swapping entities

Model or implementation: Rule-based substitution using Wikidata types

Knowledge Injector

Train language model on perturbed data

Model or implementation: BERT-base / ERNIE / K-Adapter

Downstream Evaluator

Fine-tune and test on downstream tasks

Model or implementation: Task-specific heads on PLM

Novel Architectural Elements

The novelty is in the experimental framework (counterfactual data generation pipeline) rather than the model architecture itself

Modeling

Base Model: BERT-base-uncased (110M parameters)

Training Method: Pre-training from scratch (BERT) or Continued Pre-training (ERNIE, K-Adapter)

Objective Functions:

Purpose: Learn contextual representations.

Formally: Masked Language Modeling (MLM)
Purpose: Inject structured knowledge (ERNIE).

Formally: TransE-based entity embedding alignment
Purpose: Inject structured knowledge (K-Adapter).

Formally: Relation Classification loss

Training Data:

Source: Wikipedia (14,545,579 paragraphs)
Source: Wikidata entities/relations
Perturbation: Factual substitution (same type), Ontological substitution (cross type)

Key Hyperparameters:

batch_size: 1024
learning_rate: 1e-4
warmup_steps: 10000
+ 4 more
total_steps: 500000
optimizer: Adam
adam_beta1: 0.9
adam_beta2: 0.999

Compute: 2 Nvidia A100 GPUs (80G RAM) for about 10 days (for BERT training)

Comparison to Prior Work

vs. Standard Knowledge Injection papers: They focus on showing performance *gains*; this paper focuses on *why* gains happen (or don't) via ablation of truth
vs. Random Embeddings [not cited in paper]: Similar to papers showing random graph embeddings work for some tasks, this shows wrong knowledge works for NLP tasks

Limitations

Experiments limited to BERT-base scale; did not evaluate on extremely large models (e.g., GPT-3, LLaMA)
Focuses on factual knowledge; linguistic or commonsense knowledge impact not fully isolated
One dataset (FewRel) showed sensitivity to perturbation likely due to severe leakage between pre-training KB and test set

Reproducibility

Code: https://github.com/tangqiaoyu/KnowledgeDisturb

publicly available (https://github.com/tangqiaoyu/KnowledgeDisturb). Code and perturbation logic provided. Pre-training data scale is massive (Wikipedia/Wikidata), requiring significant compute to replicate fully.

📊 Experiments & Results

Evaluation Setup

Pre-train models on vanilla vs. perturbed data, then fine-tune on downstream tasks.

Benchmarks:

LAMA (Knowledge Probing)
GLUE (Language Understanding)
CoNLL03 / OntoNotes (Named Entity Recognition)
ACE04 / ACE05 (Relation Extraction)
Natural Questions / CosmosQA / FEVER (Knowledge Applying (QA / Fact Checking))

Metrics:

P@1 (LAMA)
F1 score (NER/RE)
Accuracy (GLUE/QA)
Statistical methodology: t-test to examine significance of performance differences; threshold 0.05

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Knowledge Probing (LAMA) confirms that perturbation successfully destroys the model's factual knowledge.
LAMA	P@1	28.18	11.62	-16.56
Downstream tasks show negligible difference between models trained on correct vs. incorrect knowledge.
GLUE (Avg)	Score	80.26	80.07	-0.19
CoNLL2003	F1	91.37	91.22	-0.15
ACE2005	F1	72.93	73.12	+0.19
Natural Questions	Exact Match	50.36	50.38	+0.02
FewRel	Accuracy	88.41	86.88	-1.53

Experiment Figures

Distributions of model performance on downstream tasks across random seeds vs. perturbations

Main Takeaways

Correctness of injected factual knowledge has very limited effect on downstream task performance across NLU, NER, RE, and QA tasks.
Performance fluctuations caused by random seeds (e.g., 0.33% on GLUE) were often larger than fluctuations caused by injecting wrong knowledge (0.19%).
Even Ontological Substitution (changing 'Person' to 'Location') did not significantly degrade performance on Entity Typing or NER tasks, suggesting models rely on local context or superficial cues rather than deep ontological knowledge.
Previous claims that 'factual knowledge injection' drives performance gains are likely conflating 'knowledge' with other factors like domain adaptation or regularization.

📚 Prerequisite Knowledge

Prerequisites

Pre-training paradigms (Masked Language Modeling)
Knowledge Injection methods (ERNIE, K-Adapter)
Factual probing benchmarks (LAMA)

Key Terms

Factual Substitution: Replacing an entity in a factual statement with another entity of the SAME type (e.g., 'Tim Cook' -> 'Bill Gates')

Ontological Substitution: Replacing an entity with one of a DIFFERENT type (e.g., 'Tim Cook' -> 'Microsoft'), disrupting semantic type constraints

LAMA: LAnguage Model Analysis—a benchmark using cloze-style queries (e.g., 'Paris is the capital of [MASK]') to probe factual knowledge

ERNIE: Enhanced Language Representation with Informative Entities—a method injecting knowledge by aggregating pre-trained entity embeddings with text tokens

K-Adapter: A method using neural adapters to inject knowledge via auxiliary tasks (like relation classification) while keeping the original model frozen

Counterfactual Analysis: Investigating causal effects by asking 'what if' questions—here, observing model behavior if the training data had been different (incorrect)