← Back to Paper List

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, Iacopo Poli
Answer.AI, LightOn, Johns Hopkins University, NVIDIA, HuggingFace
arXiv.org (2024)
Pretraining Memory RAG Benchmark

📝 Paper Summary

Encoder-only Language Models Long-context Transformers Efficient Inference
ModernBERT modernizes the BERT architecture by combining a native 8192-token context window, alternating global/local attention, and hardware-aware optimizations to achieve state-of-the-art efficiency and performance.
Core Problem
Encoder-only models are vital for retrieval and classification but rely on the outdated BERT architecture, which is limited to short contexts (512 tokens), inefficient on modern GPUs, and trained on stale data.
Why it matters:
  • RAG (Retrieval-Augmented Generation) pipelines currently rely on older encoders with limited context, forcing suboptimal chunking of documents
  • Existing modernization attempts (MosaicBERT, NomicBERT) either lack efficiency, fail to extend context length sufficiently, or use outdated data mixtures missing code and recent events
  • Practitioners need efficient discriminative models that match LLM performance on specific tasks without the massive computational cost of decoder-only models
Concrete Example: A standard BERT model processing a 2000-token legal document must truncate it to 512 tokens, losing critical information, or use inefficient sliding windows. ModernBERT processes the full 8192 context natively and faster.
Key Novelty
Hardware-Aware Modernized Encoder Architecture
  • Replaces absolute positions with Rotary Positional Embeddings (RoPE) and uses alternating global/local attention layers to handle 8192-token sequences efficiently
  • Optimizes for GPU inference by removing padding (Unpadding) and using 'Deep & Narrow' layer configurations that align with GPU tensor core tiling
Evaluation Highlights
  • Native context length of 8192 tokens (vs. 512 in original BERT), enabling processing of long documents without truncation
  • Trained on 2 trillion tokens of code, web, and scientific data (vs. original BERT's much smaller corpus), significantly updating the model's knowledge base
  • Processes 8192-token sequences almost 2x faster than previous encoder models due to architectural optimizations like unpadding and alternating attention
Breakthrough Assessment
8/10
A significant infrastructure update for the NLP community. While not a new paradigm, it fixes the long-standing neglect of encoder-only models, likely becoming the new default backbone for RAG and classification.
×