SenseBERT: Driving Some Sense into BERT

📝 Paper Summary

Language Model Pre-training Word Sense Disambiguation

SenseBERT augments BERT pre-training by predicting both masked words and their WordNet supersenses, enabling the model to learn lexical semantics directly from unlabeled text.

Core Problem

Standard self-supervised models like BERT operate at the word-form level, which acts as an ambiguous surrogate for underlying meaning (senses), leading to poor performance on tasks requiring explicit semantic categorization.

Why it matters:

Word forms are ambiguous (e.g., 'bass' can mean fish or guitar), causing standard models to struggle with distinguishing meanings in context
Existing sense-aware approaches rely on static embeddings or small annotated datasets, failing to leverage the scale of unannotated corpora used by BERT
Vanilla BERT often fails to grasp lexical semantics, exhibiting high misclassification rates on supersense tasks despite its strong general performance

Concrete Example: In the sentence 'Dan cooked a bass on the grill', a standard model sees only the word 'bass'. It might predict 'salmon' based on co-occurrence but miss the semantic category. SenseBERT is explicitly trained to recognize 'bass' here as 'noun.food' rather than 'noun.artifact' (musical instrument), improving disambiguation.

Key Novelty

Weakly-Supervised Supersense Pre-training

Adds a 'supersense prediction' auxiliary task during pre-training: the model must predict the WordNet semantic category (supersense) of a masked word alongside the word itself
Uses WordNet as a weak supervisor to generate allowed sense labels for unannotated text, enabling semantic learning at scale without human-annotated datasets
Introduces a soft-labeling scheme where the model predicts any valid supersense for a word, allowing context to naturally reinforce the correct meaning over time

Architecture

Comparison of BERT and SenseBERT pre-training architectures. SenseBERT adds a parallel output head for supersense prediction and injects supersense embeddings into the input.

Evaluation Highlights

+10.5 points accuracy improvement on SemEval-SS (Supersense Disambiguation) over BERT Base in the 'Frozen' setting, showing superior intrinsic semantic knowledge
Achieves state-of-the-art score of 72.14 on the Word in Context (WiC) task with SenseBERT Large, surpassing BERT Large by 2.5 points
Outperforms BERT Large on SemEval-SS without fine-tuning (Frozen setting), attaining 79.5 vs 67.3 accuracy

Breakthrough Assessment

7/10

Significant improvement on semantic tasks (WiC, WSD) by integrating external knowledge (WordNet) into pre-training. While a strong conceptual advance, it relies on legacy WordNet resources.

⚙️ Technical Details

Problem Definition

Setting: Pre-training a Masked Language Model (MLM) with an additional auxiliary task for semantic categorization

Inputs: Sequence of words with masked tokens

Outputs: Predicted probability distributions over the vocabulary (words) and WordNet supersenses

Pipeline Flow

Input Processing (Masking & WordNet Lookup)
Transformer Encoder
Dual Output Heads (Word Prediction & Supersense Prediction)

System Modules

Input Processing (Input Processing)

Masks words and identifies 'allowed supersenses' for the masked word using WordNet

Model or implementation: Rule-based lookup

Sense-Aware Input Embedding (Input Processing)

Constructs input vectors by combining word embeddings and supersense embeddings

Model or implementation: Linear combination

Transformer Encoder

Computes contextualized representations

Model or implementation: BERT (Base or Large)

Supersense Prediction Head

Predicts the supersense of the masked word

Model or implementation: Linear projection S

Novel Architectural Elements

Dual-head pre-training objective: Parallel projection matrices W (words) and S (supersenses) trained jointly
Sense-aware input embeddings: Input vectors are constructed using both word matrix W and supersense matrix S (via mapping M), enabling semantic information flow even for rare words

Modeling

Base Model: BERT-Base and BERT-Large architectures

Training Method: Multi-task pre-training (Masked LM + Masked Supersense Prediction)

Objective Functions:

Purpose: Standard Masked Language Modeling.

Formally: L_LM = -log p(w | context)
Purpose: Maximize probability that predicted sense is within the set of allowed WordNet supersenses for the word.

Formally: L_allowed_SLM = -log sum_{s in A(w)} p(s | context)
Purpose: Regularization to prevent collapse to a single sense subset, encouraging uniform prediction over all allowed senses.

Formally: L_reg_SLM = - sum_{s in A(w)} (1/|A(w)|) log p(s | context)

Training Data:

Same data as BERT (BooksCorpus + English Wikipedia)
WordNet used for generating supersense labels A(w)

Key Hyperparameters:

vocabulary_size: Tried 30k (standard) and 60k (augmented)
masking_strategy: Prioritizes single-supersense words (50% of masked words)
learning_rate: Same as Devlin et al. (2019)
+ 1 more
max_seq_length: 128

Compute: Not reported in the paper

Comparison to Prior Work

vs. BERT: Adds explicit supersense prediction task and sense-aware input embeddings
vs. KnowBERT: Pre-trains with supersense objective directly rather than using an attention mechanism over external knowledge bases [KnowBERT-W+W cited as baseline]
vs. Loureiro and Jorge (2019): SenseBERT learns senses during pre-training rather than retrofitting/constructing them from static embeddings post-hoc

Limitations

Relies on WordNet coverage; words not in WordNet (e.g., slang, new entities) lack supersense supervision
Restricted to coarse-grained supersenses (45 categories) rather than fine-grained synsets
Soft-labeling scheme introduces noise when words have multiple valid supersenses in a specific context (though large corpus mitigates this)

Reproducibility

No code URL provided. WordNet is publicly available. The method requires modifying the BERT pre-training loop to include the auxiliary task and the WordNet lookup mechanism.

📊 Experiments & Results

Evaluation Setup

Pre-training followed by fine-tuning or frozen-embedding evaluation on lexical semantic tasks

Benchmarks:

SemEval-SS (Supersense Disambiguation (converted from WSD)) [New]
WiC (Word in Context) (Binary classification (same/different sense))
GLUE (General NLU benchmark)

Metrics:

Accuracy (SemEval-SS, WiC, GLUE tasks)
F1 (GLUE tasks)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
SemEval-SS results demonstrate the model's ability to disambiguate word senses. The 'Frozen' setting tests intrinsic knowledge without task-specific training.
SemEval-SS	Accuracy	65.1	75.6	+10.5
SemEval-SS	Accuracy	67.3	79.5	+12.2
SemEval-SS	Accuracy	81.1	83.7	+2.6
Word in Context (WiC)	Accuracy	69.6	72.1	+2.5
Word in Context (WiC)	Accuracy	70.9	72.1	+1.2
GLUE (average)	Score	77.5	77.9	+0.4

Experiment Figures

UMAP visualization of the learned supersense vectors (rows of matrix S).

Demonstration of supersense probabilities assigned to masked words in context.

Main Takeaways

SenseBERT achieves state-of-the-art on WiC, proving that weak supervision at pre-training effectively teaches fine-grained semantic distinctions
Huge gains in 'Frozen' evaluation suggest SenseBERT natively encodes semantic categories in its embeddings, unlike BERT which requires fine-tuning to surface them
The '60K vocabulary' strategy for OOV words performs comparably to 'average embedding', but both outperform the baseline of ignoring OOV senses
Visualizations show SenseBERT learns to cluster supersenses semantically (e.g., nouns vs verbs, artifacts vs food) without explicit supervision beyond the weak WordNet signal

📚 Prerequisite Knowledge

Prerequisites

Understanding of BERT and Masked Language Modeling (MLM)
Familiarity with WordNet hierarchy (synsets, lemmas, supersenses)
Basic knowledge of multi-task learning

Key Terms

Supersense: A coarse-grained semantic category from WordNet (e.g., 'noun.animal', 'verb.motion'); there are 45 total categories

WordNet: A large lexical database of English grouping words into sets of synonyms (synsets) and defining relationships between them

WiC: Word in Context—a binary classification task determining if a target word has the same meaning in two different sentences

SemEval-SS: A variant of the SemEval Word Sense Disambiguation task where fine-grained senses are mapped to coarse-grained supersenses

Soft-labeling: A training approach where the target is a distribution over multiple valid labels (allowed supersenses) rather than a single ground truth, used here to handle ambiguity in unannotated text

Weight tying: Using the same matrix for both the input embeddings and the output projection layer to share parameters and improve representation quality