WKLM Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model

📝 Paper Summary

Knowledge-Enhanced Language Models Weakly Supervised Pretraining

WKLM (Weakly Supervised Knowledge-Pretrained Language Model) enhances BERT by training it to distinguish correct entity mentions from same-type negative replacements, improving factual knowledge retention.

Core Problem

Standard language model pretraining (like BERT) captures syntax and semantics but struggles to explicitly model entity-centric encyclopedic knowledge.

Why it matters:

Pretrained models often fail tasks requiring external world knowledge (e.g., specific facts about entities) despite training on large corpora.
Existing solutions often require complex external knowledge base integrations or memory-heavy architectures.
Zero-shot fact completion reveals that standard BERT encodes entity-level knowledge only to a limited degree.

Concrete Example: In a sentence like 'The capital of France is Paris', a standard LM might predict 'Paris' based on collocation. WKLM is explicitly trained to distinguish 'Paris' from other cities (e.g., 'London', 'Berlin') placed in that context, forcing it to learn the factual relationship rather than just linguistic patterns.

Key Novelty

Entity Replacement Training (Weakly Supervised)

Instead of just masking random tokens, the model identifies entity mentions and replaces them with other entities of the *same type* (e.g., replacing a person with another person).
The model must determine if an entity in the text is the original correct one or a replacement, effectively training it to fact-check statements.
This objective is combined with standard Masked Language Modeling (MLM) but requires no external knowledge base architecture changes during fine-tuning.

Architecture

Illustration of the Type-Constrained Entity Replacement strategy.

Evaluation Highlights

Outperforms BERT-large on Zero-Shot Fact Completion (Hits@10) with significant gains (e.g., +24.8% on 'Capital Of' relation).
Achieves new state-of-the-art on FIGER fine-grained entity typing with 60.21% accuracy (+5.68% over BERT base).
Improves open-domain QA performance on WebQuestions, TriviaQA, and Quasar-T by an average of 2.7 F1 score over BERT.

Breakthrough Assessment

7/10

Simple yet highly effective pretraining objective that significantly improves knowledge grounding without architectural changes. Sets SOTA on entity typing and improves QA, though relies on existing BERT architecture.

⚙️ Technical Details

Problem Definition

Setting: Pretraining a language encoder to capture real-world entity knowledge from unstructured text.

Inputs: Text documents with recognized entity mentions (linked to Wikipedia/Wikidata).

Outputs: Binary prediction for each entity mention indicating whether it is the original correct entity or a replacement.

Pipeline Flow

Entity Recognition & Linking (Identify mentions in text)
Negative Sampling (Replace mentions with same-type entities)
Encoder Processing (BERT processes modified text)
Prediction (Binary classification on entity tokens + MLM prediction)

System Modules

Data Preprocessor

Identify entities using Wikipedia anchors/Wikidata aliases and create negative examples by replacing entities with others of the same type.

Model or implementation: String matching against Wikidata

Language Encoder

Encode the text context and entity mentions.

Model or implementation: BERT-base architecture (12 layers, 768 hidden dim)

Entity Discriminator head

Predict if an entity mention is original or replaced based on boundary word representations.

Model or implementation: Linear layer over concatenated boundary word representations

Novel Architectural Elements

Entity-level binary classification head utilizing boundary tokens (words before/after entity) to judge factual correctness

Modeling

Base Model: BERT-base (12 layers, 768 hidden)

Training Method: Multi-task learning: Masked Language Modeling (MLM) + Entity Replacement discrimination

Objective Functions:

Purpose: Distinguish correct entities from replacements.

Formally: J = Ind(e in E+) * log P(e|C) + (1 - Ind(e in E+)) * log (1 - P(e|C))
Purpose: Standard language modeling.

Formally: Masked Language Model loss (masks restricted to outside entity spans)

Training Data:

English Wikipedia dump
Entities linked via anchors and Wikidata aliases
Contexts split into 512-token chunks

Key Hyperparameters:

learning_rate: 1e-5
batch_size: 128
dropout: 0.05 (final layer)
+ 2 more
masking_ratio: 0.05 (for MLM, lower than standard 0.15)
training_steps: 1 million updates

Compute: 32 V100 GPUs for 3 days

Comparison to Prior Work

vs. ERNIE: WKLM learns knowledge directly from text via replacement detection rather than injecting separate KB embeddings.
vs. BERT: WKLM uses entity-centric negative sampling and explicit factuality objectives.
vs. GPT-2: WKLM is bidirectional and uses discriminative entity training.

Limitations

Relies on entity linking heuristics (anchor links/string matching) which may introduce noise.
Requires known entity types from Wikidata for negative sampling constraint.
BERT-large baseline outperforms WKLM (based on BERT-base) on some relations suggesting scale still matters.
Masking ratio reduced to 5% to accommodate entity objective, potentially affecting pure language modeling capability.

Reproducibility

Code availability is not explicitly provided in the paper text. Pretraining uses Wikipedia dump. Implementation based on Fairseq. Entity linking relies on Wikidata aliases.

📊 Experiments & Results

Evaluation Setup

Zero-shot fact completion and fine-tuning on downstream entity-heavy tasks.

Benchmarks:

Wikidata Fact Completion (Zero-shot Cloze Ranking) [New]
WebQuestions (Open-domain QA)
TriviaQA (Open-domain QA)
SearchQA (Open-domain QA)
Quasar-T (Open-domain QA)
FIGER (Fine-grained Entity Typing)

Metrics:

Hits@10
Exact Match (EM)
F1 score
Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Wikidata Relations (Average)	Hits@10	16.1	28.9	+12.8
WebQuestions	F1	35.5	37.9	+2.4
TriviaQA	F1	53.2	56.7	+3.5
FIGER	Accuracy	52.04	60.21	+8.17
SQuAD	F1	87.6	91.3	+3.7

Main Takeaways

WKLM effectively captures entity knowledge, evidenced by strong zero-shot fact completion results.
The method generalizes well to downstream QA tasks involving entities, consistently outperforming standard BERT.
Entity Replacement Training serves as a complementary objective to MLM; removing MLM hurts performance, but adding WKLM to MLM boosts it.
Performance gains are most significant on datasets with heavy entity focus (e.g., TriviaQA) compared to informal queries (SearchQA).

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (BERT)
Masked Language Modeling (MLM)
Entity Linking concepts
Knowledge Graph basics (Triples, Relations)

Key Terms

WKLM: Weakly Supervised Knowledge-Pretrained Language Model—the proposed model that learns to distinguish true entities from same-type replacements.

MLM: Masked Language Model—a pretraining objective where random tokens are hidden and the model must predict them.

Hits@10: A metric measuring the percentage of times the correct answer appears in the top 10 predictions.

Entity Replacement: The core pretraining strategy where an entity mention is swapped with a random entity of the same type to create a negative training example.

Zero-shot fact completion: A task where the model must predict missing entities in factual statements (converted from knowledge base triples) without specific training on those facts.

FIGER: A dataset for fine-grained entity typing, requiring models to assign specific types to entity mentions.

SQuAD: Stanford Question Answering Dataset—a reading comprehension benchmark.

Wikidata: A structured knowledge base used here to determine entity types and validate relations.

Entity Linking: The process of identifying entity mentions in text and mapping them to unique identifiers in a knowledge base.