K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters

📝 Paper Summary

Knowledge Injection Parameter-Efficient Fine-Tuning

K-ADAPTER injects diverse knowledge into pre-trained models via independent neural adapters that keep the original model fixed, preventing catastrophic forgetting and enabling disentangled representations.

Core Problem

Injecting knowledge into pre-trained models typically involves updating all parameters, which causes 'catastrophic forgetting' where previously learned knowledge is lost when new knowledge is added.

Why it matters:

Standard pre-trained models (like BERT/RoBERTa) often struggle with factual reasoning and negation despite high linguistic performance
Multi-task learning approaches produce entangled representations, making it difficult to investigate specific knowledge effects or add new knowledge sources without retraining everything
Retraining massive models for every new knowledge type is computationally expensive and inefficient

Concrete Example: Given 'New Fabris closed down June 16', RoBERTa predicts 'no relation'. To correctly predict 'city of birth' for the entity pair, the model needs external factual knowledge that 'New Fabris' is a company, which RoBERTa lacks but K-ADAPTER provides.

Key Novelty

Modular Knowledge Infusion via Parallel Adapters

Keeps the large pre-trained model frozen and plugs in compact neural 'adapters' (small transformer networks) that run in parallel to the main model
Each adapter is trained independently on a specific knowledge task (e.g., relation classification for facts, dependency parsing for linguistics), allowing modular additions without retraining previous components
Outputs from the frozen model and relevant adapters are concatenated to form a final representation that fuses general language understanding with specific injected knowledge

Architecture

The conceptual framework of K-ADAPTER with parallel adapters and the detailed internal structure of an adapter layer.

Evaluation Highlights

Outperforms RoBERTa Large on Entity Typing (OpenEntity) by +1.38% F1 using combined factual and linguistic adapters
Achieves +4.01% F1 improvement over WKLM on SearchQA open-domain question answering
Surpasses RoBERTa Large on CosmosQA commonsense reasoning by +1.24% accuracy with combined adapters

Breakthrough Assessment

8/10

Pioneered the use of adapters specifically for disentangled knowledge injection rather than just transfer learning. Solves catastrophic forgetting in knowledge infusion effectively.

⚙️ Technical Details

Problem Definition

Setting: Injecting specific knowledge types (factual, linguistic) into a pre-trained language model without altering its original parameters.

Inputs: Input token sequence (e.g., a sentence or question-passage pair)

Outputs: Enhanced contextual representations combining original model features and adapter features, used for downstream classification or prediction tasks.

Pipeline Flow

Input Tokens -> RoBERTa (Frozen) -> Intermediate Hidden States
Intermediate States -> Factual Adapter (Trainable) -> Factual Features
Intermediate States -> Linguistic Adapter (Trainable) -> Linguistic Features
Concatenation (RoBERTa Final State + Adapter Final States) -> Task Specific Layer

System Modules

RoBERTa Large

Provides general language representation; parameters are frozen during adapter training

Model or implementation: RoBERTa Large (L=24, H=1024, A=16)

Factual Adapter (facAdapter) (Knowledge Injection)

Captures factual knowledge by training on relation classification

Model or implementation: 2 Transformer layers + projection layers (42M params)

Linguistic Adapter (linAdapter) (Knowledge Injection)

Captures syntactic knowledge by training on dependency relation prediction

Model or implementation: 2 Transformer layers + projection layers (42M params)

Novel Architectural Elements

Parallel adapter structure: Adapters run alongside the backbone rather than inserted sequentially inside layers (contrast to Houlsby et al. 2019)
Disentangled storage: Different adapters store different knowledge types independently, enabling plug-and-play combinations

Modeling

Base Model: RoBERTa Large (355M parameters)

Training Method: Adapter pre-training on auxiliary tasks while keeping base model frozen

Objective Functions:

Purpose: Learn factual relations.

Formally: Relation classification (Cross-Entropy) on T-REx dataset
Purpose: Learn linguistic structures.

Formally: Dependency relation prediction (predicting head index of tokens)

Trainable Parameters: Adapters only (approx 42M params per adapter vs 355M frozen backbone)

Training Data:

Factual: T-REx-rc (sub-dataset of T-REx), 5.5M sentences, 430 relations
Linguistic: 1M examples from Book Corpus parsed with Stanford Parser

Key Hyperparameters:

adapter_transformer_layers: 2
adapter_hidden_dim: 768
projection_up_dim: 1024
+ 7 more
projection_down_dim: 768
learning_rate_factual: 2e-5
learning_rate_linguistic: 1e-5
batch_size_factual: 128
batch_size_linguistic: 256
epochs_factual: 5
epochs_linguistic: 10

Compute: 4 NVIDIA V100 GPUs (16G)

Comparison to Prior Work

vs. ERNIE/KnowBERT: K-ADAPTER freezes the backbone and uses modular adapters, avoiding catastrophic forgetting and allowing independent training
vs. WKLM: K-ADAPTER supports multiple distinct knowledge types (linguistic + factual) via separate modules rather than a single unified objective
vs. BERT+MTB: K-ADAPTER uses explicit relation classification supervision rather than matching blanks [not cited in paper]

Limitations

Requires pre-training specific adapters for each new knowledge type, which takes time and resources
Inference cost increases as more adapters are added (parallel computation helps but total FLOPs increase)
Comparison on LAMA probe suggests byte-level BPE (RoBERTa) might inherently be worse at factual recall than character-level BPE (BERT), impacting the baseline

Reproducibility

Code: https://github.com/microsoft/k-adapter

Code publicly available at https://github.com/microsoft/k-adapter. Uses HuggingFace Transformers. Pre-trained on specific subsets of T-REx and BookCorpus (parsers/preprocessing described).

📊 Experiments & Results

Evaluation Setup

Evaluation on 3 downstream tasks (Entity Typing, QA, Relation Classification) and 1 probing task (LAMA)

Benchmarks:

OpenEntity (Fine-grained Entity Typing)
FIGER (Fine-grained Entity Typing)
CosmosQA (Commonsense Question Answering)
SearchQA (Open-domain Question Answering)
Quasar-T (Open-domain Question Answering)
TACRED (Relation Classification)
LAMA (Knowledge Probing (Zero-shot))

Metrics:

Micro-F1
Macro-F1
Accuracy
Exact Match (EM)
P@1 (Precision at 1)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Entity Typing results show that adding adapters consistently improves over the RoBERTa baseline, with combined adapters often performing best.
OpenEntity	Micro-F1	76.23	77.61	+1.38
FIGER	Accuracy	56.31	61.81	+5.50
Question Answering results demonstrate significant gains in open-domain and commonsense tasks.
CosmosQA	Accuracy	80.59	81.83	+1.24
SearchQA	F1	63.30	67.31	+4.01
Relation Classification and Probing results confirm the adapters successfully encode factual knowledge.
TACRED	F1	71.25	72.04	+0.79
LAMA-T-REx	P@1	27.1	29.1	+2.0

Main Takeaways

K-ADAPTER consistently outperforms the strong RoBERTa Large baseline across all 6 datasets, validating the efficacy of the adapter approach.
Combining both factual (F) and linguistic (L) adapters (K-ADAPTER F+L) generally yields the highest performance, suggesting different knowledge types are complementary.
K-ADAPTER outperforms multi-task learning baselines (RoBERTa + multitask), supporting the claim that disentangled adapters avoid catastrophic forgetting better than shared-parameter training.
Probing with LAMA shows K-ADAPTER captures richer factual knowledge than RoBERTa, specifically improving recall on relation-based queries.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (specifically BERT/RoBERTa)
Concept of 'Adapters' in neural networks
Multi-task learning vs. Continual learning

Key Terms

Adapter: A small neural module added to a pre-trained model to learn new tasks/knowledge while keeping the main model frozen

Catastrophic forgetting: The tendency of neural networks to lose previously learned information upon learning new information

RoBERTa: Robustly optimized BERT approach; a transformer-based masked language model

T-REx: A large-scale alignment dataset between Wikipedia abstracts and Wikidata triples, used here for factual knowledge

Dependency parsing: Analyzing the grammatical structure of a sentence to establish relationships between 'head' words and words which modify those heads

LAMA: LAnguage Model Analysis—a probe to test factual knowledge in language models using cloze-style questions

Disentangled representation: Representations where different types of information (e.g., syntax vs. facts) are separated rather than mixed together

Skip-connection: A direct connection between non-adjacent layers in a neural network that allows information to bypass intermediate layers