← Back to Paper List

Long-Context Encoder Models for Polish Language Understanding

Sławomir Dadas, Rafał Poświata, Marek Kozłowski, Małgorzata Grębowiec, Michał Perełkiewicz, Paweł Klimiuk, Przemysław Boruta
arXiv (2026)
Pretraining Benchmark Memory

📝 Paper Summary

Encoder-only Language Models Long-context modeling
The paper introduces a Polish RoBERTa encoder adapted for 8192-token contexts via two-stage training and packing optimizations, achieving state-of-the-art performance on long-document financial tasks.
Core Problem
Standard Polish encoders like BERT and HerBERT are limited to a 512-token context window, which is insufficient for processing long documents like banking contracts or regulations.
Why it matters:
  • Existing solutions for long contexts rely on generative LLMs (Decoders), which are computationally expensive and less parameter-efficient for discriminative tasks compared to Encoders
  • Truncating long documents to fit 512 tokens results in information loss, particularly for tasks where critical information appears late in the text
Concrete Example: In a 'Banking-Long' classification task involving full articles exceeding several thousand tokens, a standard 512-token model must truncate the text, potentially missing the key thematic indicators found in the middle or end of the document, whereas the proposed model processes the full context.
Key Novelty
polish-roberta-8k with Two-Stage Context Adaptation
  • Adapts an existing pre-trained RoBERTa model to 8k context by first training only the new positional embeddings (to prevent gradient shock) and then fine-tuning all parameters
  • Implements contamination-free packing and Flash Attention to make training on long sequences computationally feasible and efficient
  • Introduces 'FinBench', a new benchmark suite for Polish financial and banking tasks, including long-document classification
Evaluation Highlights
  • Achieves +8 percentage points improvement over HerBERT on internal Banking Emails classification by combining extended context with domain adaptation
  • Distilled 6-layer model achieves 115% throughput of base-sized models while maintaining comparable quality on most short-text tasks
  • Outperforms competitive Polish models on long-context tasks (Banking-Long, EURLEX, MIPD) while maintaining quality on the standard KLEJ benchmark
Breakthrough Assessment
7/10
Significant engineering contribution for low-resource languages (Polish). While the architecture is standard RoBERTa, the successful context extension and rigorous domain-specific evaluation (FinBench) provide high practical value.
×