← Back to Paper List

It's All in The [MASK]: Simple Instruction-Tuning Enables BERT-like Masked Language Models As Generative Classifiers

B Clavié, N Cooper, B Warner
Answer.AI
arXiv, 2/2025 (2025)
Pretraining Reasoning Benchmark

📝 Paper Summary

Encoder-only Language Models Instruction Tuning Zero-shot Classification
ModernBERT-Large-Instruct repurposes the masked language modeling head of a modern encoder for generative classification, achieving strong zero-shot performance with minimal engineering or architectural changes.
Core Problem
Encoder-only models like BERT rely on task-specific classification heads that require fine-tuning and struggle with zero-shot tasks compared to decoder-based LLMs.
Why it matters:
  • Encoder models are significantly cheaper and faster for inference than LLMs but lag in flexibility and zero-shot capabilities.
  • Existing methods to make encoders generative often require heavy overhead like complex prompting, converting to autoregressive modes, or architectural tweaks (e.g., custom attention masks).
  • Industry still relies heavily on older encoders (BERT, RoBERTa) which lack the benefits of modern data mixes and architectures found in recent LLMs.
Concrete Example: To classify a sentence sentiment zero-shot, a standard BERT needs a task-specific head trained on labeled data. In contrast, ModernBERT-Large-Instruct reads 'Sentiment: [MASK]' and directly generates 'Positive' or 'Negative' using its pre-trained MLM head, similar to how an LLM would generate a response.
Key Novelty
Generative Masked Language Modeling for Instruction Following
  • Treats classification as a Cloze-style fill-in-the-blank task where the model generates a single token answer via the Masked Language Modeling (MLM) head.
  • Uses a simplified instruction-tuning process on the FLAN dataset, filtering for single-token answers to align with the non-autoregressive nature of encoders.
  • Discovers that mixing in 'dummy' MLM examples (standard masking with degenerate labels) acts as a regularizer, significantly boosting performance.
Evaluation Highlights
  • Outperforms similarly sized LLMs (SmolLM2-360M) and encoder baselines on MMLU, achieving 93% of Llama3-1B’s performance with 60% fewer parameters.
  • Surpasses traditional classification-head fine-tuning methods on 2 out of 3 zero-shot tasks (ADE, One Stop English).
  • Matches or exceeds fully fine-tuned classification heads on diverse NLU tasks (news subject, entailment, emotion detection) when using the generative MLM head approach.
Breakthrough Assessment
7/10
Strong proof-of-concept that modern encoders can be effective generative classifiers without complex architectural changes, challenging the assumption that only decoders excel at instruction following.
×