TRELM: Towards Robust and Efficient Pre-training for Knowledge-Enhanced Language Models

📝 Paper Summary

Knowledge-Enhanced Pre-trained Language Models (KEPLMs) Efficient Pre-training

TRELM enhances language model pre-training by selectively injecting knowledge only into semantically important entities and updating only specific Feed-Forward Network neurons identified via dynamic knowledge routing.

Core Problem

Existing Knowledge-Enhanced PLMs indiscriminately inject knowledge into all entities (introducing noise/redundancy) and update all model parameters during pre-training (incurring high computational costs).

Why it matters:

Indiscriminate injection introduces irrelevant or redundant knowledge, degrading model performance on downstream tasks due to noise
Updating all parameters for knowledge integration is computationally expensive and inefficient
Long-tail entities in text corpora are often suboptimally optimized, hindering effective knowledge acquisition

Concrete Example: If a sentence mentions a common entity like 'the' or a very frequent entity unrelated to the sentence's core fact, standard KEPLMs still inject knowledge triples, adding noise. Furthermore, they update the entire Transformer to learn this, wasting compute.

Key Novelty

Robust and Efficient Knowledge Injection with Dynamic Routing

Identifies 'important entities' using a semantic importance score to filter out noisy or redundant knowledge injection targets
Maintains a 'Knowledge-augmented Memory Bank' (KMB) that acts as a cheat sheet, storing global and local entity representations to support long-tail entities
Uses 'Dynamic Knowledge Routing' to identify specific neurons in FFN layers responsible for factual knowledge and selectively updates only those parameters during pre-training

Architecture

The overall framework of TRELM, illustrating the interaction between input text, the Knowledge-augmented Memory Bank (KMB), and the Transformer encoder with Dynamic Knowledge Routing.

Evaluation Highlights

Reduces pre-training time by over 50% compared to standard KEPLM approaches while maintaining or improving performance
Outperforms strong baselines (like DKPLM and ERNIE) on the LAMA knowledge probing benchmark
Achieves superior performance on relation extraction and entity typing tasks compared to previous state-of-the-art KEPLMs

Breakthrough Assessment

7/10

Significant efficiency gains (50% faster) combined with robustness improvements make this a practical advancement for KEPLMs, though the core architecture remains Transformer-based.

⚙️ Technical Details

Problem Definition

Setting: Pre-training a language model using large-scale text corpora augmented with an external Knowledge Graph

Inputs: Input token sequence x and a Knowledge Graph G=(E, R)

Outputs: Contextual representations and predicted tokens (via MLM) or entity correctness (via CKA)

Pipeline Flow

Important Entity Detection (SI Score)
Knowledge Retrieval & KMB Update
Input Embedding Construction (Text + Knowledge/KMB)
Transformer Encoding with Dynamic Routing
Loss Calculation (MLM + CKA)

System Modules

Entity Filter

Selects entities for injection based on Semantic Importance (SI) scores

Model or implementation: Based on representation similarity

Knowledge-augmented Memory Bank (KMB)

Stores and retrieves local and global memory representations for entities

Model or implementation: Key-Value Store with moving average updates

Dynamic Knowledge Router

Identifies knowledge paths in FFNs using attribution scores to determine which parameters to update

Model or implementation: Integrated Gradients (Riemann approximation)

Novel Architectural Elements

Knowledge-augmented Memory Bank (KMB) integrated into the input layer to support long-tail entities
Dynamic Knowledge Routing mechanism that selectively unfreezes specific FFN neurons based on attribution scores during the backward pass

Modeling

Base Model: RoBERTa-base / RoBERTa-large

Training Method: Knowledge-Enhanced Pre-training with Selective Updates

Objective Functions:

Purpose: Standard masked token prediction.

Formally: Masked Language Modeling (MLM) loss
Purpose: Verify if the model captures knowledge by distinguishing correct tail entities.

Formally: Contrastive Knowledge Assessing (CKA) loss = -log(exp(f(h,t))/sum(exp(f(h,t'))))

Training Data:

Wikipedia (text corpus)
Wikidata5M (Knowledge Graph)

Key Hyperparameters:

lambda: 0.5 (initial KMB mixing weight, decays to 0)
gamma: Discount factor for KMB updates (value in (0,1))
m: 20 (Riemann approximation steps for attribution)
+ 1 more
beta: Decay rate controller for lambda

Compute: Pre-training time reduced by >50% compared to full parameter updates (exact GPU hours not explicitly reported in text, relative reduction emphasized)

Comparison to Prior Work

vs. ERNIE-THU: TRELM selectively updates parameters via routing instead of training additional large encoders
vs. DKPLM: TRELM uses a Memory Bank (KMB) for long-tail entities and attribution-based routing for efficiency
vs. K-BERT: TRELM filters entities by semantic importance to reduce noise, whereas K-BERT injects available triples
+ 1 more
vs. K-Adapter [not cited in paper]: TRELM modifies internal FFNs selectively, whereas K-Adapter freezes the model and adds external adapters

Limitations

Relies on the hypothesis that factual knowledge is primarily stored in FFN layers (debated in broader literature)
Calculation of attribution scores (Integrated Gradients) adds forward-pass overhead, though net training time decreases due to faster backprop
Performance depends on the quality and coverage of the external Knowledge Graph (Wikidata5M)

Reproducibility

Code: https://github.com/alibaba/EasyNLP

Source code available in the Alibaba EasyNLP framework (https://github.com/alibaba/EasyNLP). Uses standard datasets (Wikipedia, Wikidata5M, LAMA). Implementation details for attribution (m=20) provided.

📊 Experiments & Results

Evaluation Setup

Pre-training on Wikipedia+Wikidata5M, followed by fine-tuning or probing on downstream tasks.

Benchmarks:

LAMA (Knowledge Probing (Cloze-style QA))
Open Entity (Entity Typing)
TACRED (Relation Extraction)

Metrics:

P@1 (Precision at 1) for LAMA
F1 score for Entity Typing
F1 score for Relation Extraction
Training Time / Acceleration Ratio
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
TRELM outperforms baselines on knowledge probing tasks (LAMA), indicating superior factual knowledge retention.
LAMA (Google-RE)	P@1 (Mean)	12.0	13.2	+1.2
LAMA (T-REx)	P@1 (Mean)	32.4	34.1	+1.7
TRELM demonstrates improvements on downstream knowledge-aware tasks like Relation Extraction and Entity Typing.
TACRED	F1	70.6	71.6	+1.0
Open Entity	F1	77.8	78.3	+0.5
Efficiency experiments show significant reductions in training time.
Pre-training Speed	Reduction in Training Time	100	50	-50

Main Takeaways

Dynamic Knowledge Routing significantly accelerates pre-training (over 50% reduction) by updating only a subset of parameters, without sacrificing performance.
Filtering entities by Semantic Importance (SI) effectively reduces knowledge noise, preventing degradation often seen with indiscriminate injection.
The Knowledge-augmented Memory Bank (KMB) helps the model learn better representations for long-tail entities that appear infrequently in the corpus.
TRELM consistently outperforms strong KEPLM baselines (DKPLM, CoLAKE) across knowledge-intensive tasks, validating the robustness of the approach.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (specifically Feed-Forward Networks)
Masked Language Modeling (MLM)
Knowledge Graphs (entities and relations)
Integrated Gradients (for attribution)

Key Terms

KEPLM: Knowledge-Enhanced Pre-trained Language Model—a PLM that incorporates external structured knowledge (usually from Knowledge Graphs) to improve understanding

FFN: Feed-Forward Network—the dense layers within a Transformer block, hypothesized here to store factual knowledge

Knowledge Path: A sequence of specific neurons across FFN layers that are highly attributed to the prediction of a correct knowledge fact

KMB: Knowledge-augmented Memory Bank—a storage mechanism that keeps track of entity representations (local and global) to help with long-tail distribution issues

LAMA: LAnguage Model Analysis—a benchmark dataset used to probe the factual knowledge stored in pre-trained language models

Dynamic Knowledge Routing: A method to selectively update only the parameters corresponding to identified 'knowledge neurons' rather than the whole model

SI Score: Semantic Importance Score—measures how much a sentence's representation changes if a specific entity is replaced, used to identify important entities

CKA: Contrastive Knowledge Assessing—a pre-training objective where the model must distinguish the correct tail entity from negatives given a head entity and relation