Difference-Masking: Choosing What to Mask in Continued Pretraining

📝 Paper Summary

Self-Supervised Learning (SSL) Continued Pretraining / Domain Adaptation

Difference-Masking improves continued pretraining by selectively masking concepts that make a target domain different from the general pretraining domain, rather than masking randomly.

Core Problem

Standard masked language modeling randomly selects tokens to mask, but this strategy ignores the intuition that some concepts are more critical for learning a specific domain than others.

Why it matters:

Random masking forces models to spend capacity reconstructing trivial or domain-irrelevant words (e.g., 'the', 'process') rather than key domain concepts.
Effective domain adaptation is crucial when target domains (like chemistry or medical texts) differ substantially from general pretraining data.
Existing selective masking approaches often rely on supervised labels or domain-specific entity taggers which are not always available.

Concrete Example: In a chemistry corpus, random masking might hide the word 'process' (common in all domains), which is less informative for learning chemistry. Difference-Masking identifies that 'molecule' is unique to the chemistry domain compared to general web text and prioritizes masking it to force the model to learn chemical concepts.

Key Novelty

Difference-Masking (TF-ICF based masking)

Identifies 'difference anchors': words that appear frequently in the target domain but infrequently in the general pretraining corpus (using a TF-IDF-like metric called TF-ICF).
Generates a masking probability distribution where tokens are more likely to be masked if they are semantically similar to these difference anchors.
Applies this strategy to both text (using token similarity) and video (using object detection labels) without requiring supervised task labels.

Architecture

The Difference-Masking pipeline illustrating the two-step process: finding difference anchors and then masking based on similarity.

Evaluation Highlights

Outperforms random masking and 5 other baselines across 4 datasets (ChemProt, ACL-ARC, TVQA, Social-IQ).
+1.16% accuracy improvement on ChemProt over the strongest baseline (Salient Span Masking) using RoBERTa.
+2.37% accuracy improvement on Social-IQ over Random Masking using MERLOT-Reserve.

Breakthrough Assessment

7/10

Simple, intuitive, and effective method that outperforms complex baselines (like gradient-based or attention-based masking) without needing supervision. Extends well to multimodal settings.

⚙️ Technical Details

Problem Definition

Setting: Continued pretraining of a model pretrained on source domain X_PT to a target domain X_T using unlabelled data.

Inputs: Unlabelled data from target domain X_T.

Outputs: Adapted model parameters optimized for X_T.

Pipeline Flow

Difference Anchor Identification
Masking Probability Generation
Continued Pretraining (Masking & Predicting)

System Modules

Anchor Finder

Identify top-K words unique to the target domain relative to general pretraining data

Model or implementation: Statistical TF-ICF calculation (non-neural)

Mask Selector (Training / Masking)

Calculate probability of masking each token based on semantic similarity to anchors

Model or implementation: Cosine similarity using pretrained embeddings (BERT or similar)

Pretraining Model (Training / Masking)

Reconstruct masked tokens

Model or implementation: RoBERTa-base (text) or MERLOT-Reserve (multimodal)

Novel Architectural Elements

TF-ICF scoring mechanism integrated into the masking pipeline to determine mask placement based on corpus-level statistics rather than random chance or local attention.

Modeling

Base Model: RoBERTa-base (110M params) for text; MERLOT-Reserve-base (200M params) for multimodal

Training Method: Masked Language Modeling (MLM) continued pretraining

Objective Functions:

Purpose: Reconstruct masked tokens from context.

Formally: Standard cross-entropy loss on masked tokens.

Adaptation: Full fine-tuning of the model during continued pretraining phase

Key Hyperparameters:

learning_rate: 1e-4 (RoBERTa), 1e-5 (MERLOT)
batch_size: 256 (RoBERTa text), 64 (MERLOT)
masking_probability: 0.15
+ 3 more
warmup_ratio: 0.06
weight_decay: 0.01
anchor_count_K: 100

Compute: Not reported in the paper

Comparison to Prior Work

vs. Random Masking: Difference-Masking uses domain statistics to guide masking.
vs. SSM/EntityBERT: Difference-Masking does not require an external pretrained NER model.
vs. Selective Masking: Difference-Masking is fully self-supervised and does not require task labels.
+ 1 more
vs. AttnMask/InfoMask: Difference-Masking incorporates global domain statistics (corpus-level) rather than relying solely on local instance attention.

Limitations

Requires a general domain corpus (like Google Web Trillion Word Corpus) to calculate term frequencies for comparison.
Requires computing cosine similarity for every token against anchors during masking, which adds computational overhead compared to random masking.
Performance gain depends on the domain actually being distinct from the pretraining data.

Reproducibility

Code: https://github.com/SherylM/DifferenceMasking

Publicly available code at https://github.com/SherylM/DifferenceMasking. Uses standard datasets (ChemProt, ACL-ARC, TVQA, Social-IQ) and standard pretrained models (RoBERTa, MERLOT-Reserve).

📊 Experiments & Results

Evaluation Setup

Continued pretraining on in-domain unlabeled data followed by finetuning on labeled downstream tasks.

Benchmarks:

ChemProt (Relation Classification (Text))
ACL-ARC (Citation Intent Classification (Text))
TVQA (Video Question Answering (Multimodal))
Social-IQ (Social Interaction QA (Multimodal))

Metrics:

Classification Accuracy
Macro-F1
Statistical methodology: Results averaged across 5 random seeds.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Text-only domain adaptation results using RoBERTa-base.
ChemProt	Accuracy	82.59	83.75	+1.16
ACL-ARC	Macro-F1	67.76	68.96	+1.20
Multimodal video adaptation results using MERLOT-Reserve.
TVQA	Accuracy	73.20	74.15	+0.95
Social-IQ	Accuracy	70.13	72.50	+2.37

Experiment Figures

Analysis of masking probability vs. TF-ICF score for Difference-Masking compared to Random Masking.

Main Takeaways

Consistently outperforms Random Masking across all 4 datasets (text and video).
Outperforms domain-specific strategies like EntityBERT/SSM without requiring external trained taggers.
Outperforms attention-based strategies (AttnMask, InfoMask) which rely on model internals rather than domain statistics.
Demonstrates that 'what to mask' matters significantly for domain adaptation efficiency.

📚 Prerequisite Knowledge

Prerequisites

Masked Language Modeling (MLM)
TF-IDF (Term Frequency-Inverse Document Frequency)
Self-Supervised Learning (SSL)
Cosine Similarity

Key Terms

Continued Pretraining: Taking a model already trained on a large general corpus and training it further on a smaller, domain-specific corpus.

TF-ICF: Term-Frequency Inverse-Corpus-Frequency—a metric proposed here to score words based on their frequency in the target domain versus a general pretraining corpus.

Difference Anchors: Top-K words identified by TF-ICF that best represent the unique concepts of the target domain.

Anchor Similarity: The cosine similarity between a token's embedding and the embeddings of the identified difference anchors.

MLM: Masked Language Modeling—a training task where the model must predict missing (masked) words in a sentence.