Flash Attention: A memory-efficient attention algorithm that reduces memory access overhead, speeding up training and inference for long sequences
Contamination-free packing: A training technique where multiple short documents are concatenated into one sequence to maximize efficiency, but attention is masked so documents do not attend to each other
Knowledge Distillation: A compression technique where a smaller 'student' model learns to mimic the behavior (outputs or internal states) of a larger 'teacher' model
MLM: Masked Language Modeling—a pre-training objective where random tokens in the input are hidden, and the model must predict them based on context
KLEJ: A comprehensive benchmark for evaluating Polish language understanding models, similar to the English GLUE benchmark
FinBench: A newly introduced suite of 7 Polish-language tasks from the banking and finance domain
SFT: Supervised Fine-Tuning—training a model on labeled data for a specific task