Long-Context Encoder Models for Polish Language Understanding

📝 Paper Summary

Encoder-only Language Models Long-context modeling

The paper introduces a Polish RoBERTa encoder adapted for 8192-token contexts via two-stage training and packing optimizations, achieving state-of-the-art performance on long-document financial tasks.

Core Problem

Standard Polish encoders like BERT and HerBERT are limited to a 512-token context window, which is insufficient for processing long documents like banking contracts or regulations.

Why it matters:

Existing solutions for long contexts rely on generative LLMs (Decoders), which are computationally expensive and less parameter-efficient for discriminative tasks compared to Encoders
Truncating long documents to fit 512 tokens results in information loss, particularly for tasks where critical information appears late in the text

Concrete Example: In a 'Banking-Long' classification task involving full articles exceeding several thousand tokens, a standard 512-token model must truncate the text, potentially missing the key thematic indicators found in the middle or end of the document, whereas the proposed model processes the full context.

Key Novelty

polish-roberta-8k with Two-Stage Context Adaptation

Adapts an existing pre-trained RoBERTa model to 8k context by first training only the new positional embeddings (to prevent gradient shock) and then fine-tuning all parameters
Implements contamination-free packing and Flash Attention to make training on long sequences computationally feasible and efficient
Introduces 'FinBench', a new benchmark suite for Polish financial and banking tasks, including long-document classification

Evaluation Highlights

Achieves +8 percentage points improvement over HerBERT on internal Banking Emails classification by combining extended context with domain adaptation
Distilled 6-layer model achieves 115% throughput of base-sized models while maintaining comparable quality on most short-text tasks
Outperforms competitive Polish models on long-context tasks (Banking-Long, EURLEX, MIPD) while maintaining quality on the standard KLEJ benchmark

Breakthrough Assessment

7/10

Significant engineering contribution for low-resource languages (Polish). While the architecture is standard RoBERTa, the successful context extension and rigorous domain-specific evaluation (FinBench) provide high practical value.

⚙️ Technical Details

Problem Definition

Setting: Discriminative natural language understanding (classification, regression) on Polish texts up to 8192 tokens

Inputs: Input token sequence T = {t_1, ..., t_n} where n <= 8192

Outputs: Task-specific labels or continuous values (e.g., sentiment class, semantic similarity score)

Pipeline Flow

Input Processing (Tokenization & Packing)
Context Encoding (RoBERTa-8k)
Task Adaptation (Head)

System Modules

Tokenizer

Converts raw text into token IDs

Model or implementation: Based on polish-roberta-large-v2 vocabulary

Encoder

Generates contextualized vector representations for tokens

Model or implementation: RoBERTa (modified for 8192 context, Flash Attention)

Classification Head

Predicts final task labels based on encoder output

Model or implementation: Task-specific linear layers

Novel Architectural Elements

Integration of Flash Attention and contamination-free packing into the Polish RoBERTa architecture to support efficient 8k context training

Modeling

Base Model: polish-roberta-large-v2

Training Method: Two-stage Continuous Pre-training + Knowledge Distillation

Objective Functions:

Purpose: Adapt model to language patterns.

Formally: Masked Language Modeling (MLM) with 20% masking rate and whole word masking.
Purpose: Compress model.

Formally: Mean Squared Error (MSE) between teacher and student final layer representations.

Adaptation: Positional Embedding Adaptation (Stage 1) -> Full Parameter Tuning (Stage 2)

Training Data:

Corpus: 150 billion Polish tokens
Internal Banking Corpus (for domain adaptation): ~8 billion tokens (emails, financial texts, complaints, OCR docs)

Key Hyperparameters:

max_learning_rate: 2e-5 (base training), 5e-5 (distillation)
batch_size: 128 sequences (Base training), 64 (Distillation)
sequence_length: 8192
+ 3 more
warmup_steps: 500 batches
masking_rate: 20%
distillation_epochs: 1

Compute: Not reported in the paper

Comparison to Prior Work

vs. polish-roberta-large-v2: Extends context from 512 to 8192 tokens via positional embedding adaptation
vs. EuroBERT/mmBERT: Trained specifically on Polish corpus (150B tokens) rather than multilingual mix, resulting in better performance on Polish benchmarks
vs. HerBERT: Significantly larger context window (8192 vs 512) and domain adaptation capabilities

Limitations

Distillation used simple MSE on output layers rather than attention distillation due to computational complexity at 8k context
Smallest distilled model (6L) shows performance degradation on specific semantic relation tasks (DYK, PSC) compared to base models
Internal banking evaluation relies on proprietary data not available for public verification

Reproducibility

Paper states 'We trained and released' the model, implying public availability, but no URL is explicitly provided in the text. FinBench datasets were either created or translated (using DeepSeek-V3). Internal banking data is proprietary and not released.

📊 Experiments & Results

Evaluation Setup

Evaluation on 25 downstream tasks including classification and regression across general and financial domains

Benchmarks:

KLEJ Benchmark (General Polish NLU (Sentiment, NER, Relations))
FinBench (Financial/Banking NLU (7 tasks)) [New]
Internal Banking Evaluation (Domain-specific classification (Emails, Mortgage docs)) [New]

Metrics:

Accuracy
F1 Score
R2 Score
Spearman's rank correlation
Statistical methodology: Five independent fine-tuning runs with random seeds; reported average values.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Internal banking evaluation demonstrates the benefits of domain adaptation and extended context.
Internal Banking Emails Dataset	Accuracy (implied)	Not reported in the paper	Not reported in the paper	+8.0
Efficiency comparisons show the distilled models offer significant throughput gains.
Throughput Analysis	Relative Throughput	100	115	+15
Throughput Analysis	Relative Throughput	100	57	-43
Distilled model performance analysis reveals specific task weaknesses.
DYK (KLEJ Benchmark)	Performance Score	Not reported in the paper	Not reported in the paper	-9.0
PSC (KLEJ Benchmark)	Performance Score	Not reported in the paper	Not reported in the paper	-4.0

Main Takeaways

The 8192-token context model (polish-roberta-8k) achieves the best average performance among Polish encoders, with decisive wins on long-text tasks like Banking-Long and EURLEX.
Progressive distillation (Teacher -> 12L -> 6L) yields better small models than direct distillation (Teacher -> 6L).
Domain adaptation on internal banking data combined with context extension yields significant gains (+8pp) on specialized tasks compared to general-purpose baselines.
New multilingual models (EuroBERT, mmBERT) generally underperform on Polish specific tasks compared to dedicated Polish models, though they handle long contexts well due to architectural features like rotary embeddings.

📚 Prerequisite Knowledge

Prerequisites

Transformer Encoder architecture (BERT/RoBERTa)
Positional Embeddings
Knowledge Distillation
Masked Language Modeling (MLM)

Key Terms

Flash Attention: A memory-efficient attention algorithm that reduces memory access overhead, speeding up training and inference for long sequences

Contamination-free packing: A training technique where multiple short documents are concatenated into one sequence to maximize efficiency, but attention is masked so documents do not attend to each other

Knowledge Distillation: A compression technique where a smaller 'student' model learns to mimic the behavior (outputs or internal states) of a larger 'teacher' model

MLM: Masked Language Modeling—a pre-training objective where random tokens in the input are hidden, and the model must predict them based on context

KLEJ: A comprehensive benchmark for evaluating Polish language understanding models, similar to the English GLUE benchmark

FinBench: A newly introduced suite of 7 Polish-language tasks from the banking and finance domain

SFT: Supervised Fine-Tuning—training a model on labeled data for a specific task