← Back to Paper List

Bielik-Minitron-7B: Compressing Large Language Models via Structured Pruning and Knowledge Distillation for the Polish Language

Remigiusz Kinas, Paweł Kiszczak, Sergio P. Perez, Krzysztof Ociepa, Łukasz Flis, Krzysztof Wróbel, Adrian Gwoździej
NVIDIA, Bielik.AI, Jagiellonian University
arXiv (2026)
Pretraining RL Reasoning Benchmark

📝 Paper Summary

Model Compression Large Language Models (LLMs)
Bielik-Minitron-7B compresses an 11B Polish language model to 7B via hybrid structured pruning and logit-based distillation, maintaining linguistic competence while significantly reducing deployment costs.
Core Problem
Deploying high-performance Large Language Models (LLMs) for specific European languages like Polish requires excessive computational resources (VRAM), while training smaller models from scratch is prohibitively expensive.
Why it matters:
  • High-performance reasoning models usually exceed the memory capacity of consumer-grade hardware (e.g., NVIDIA RTX 4090), limiting local adoption
  • Training language-specific models from scratch has a massive carbon footprint and financial cost compared to compressing existing flagship models
  • Current English-centric compression research often neglects the morphological complexity of languages like Polish, necessitating tailored pruning strategies
Concrete Example: A standard 11B parameter model requires enterprise-grade GPUs to run efficiently. By compressing it to 7B, the model becomes deployable on consumer hardware, but naive pruning destroys its ability to handle Polish grammar (inflections, cases) unless carefully aligned via distillation.
Key Novelty
Two-Stage Compression with Hybrid Pruning and Alignment
  • Applies NVIDIA's Minitron approach to prune along four axes simultaneously (depth, width, attention heads, MLP size) rather than just one, preserving the most critical circuits for Polish reasoning
  • Combines activation-based importance estimation (pruning weights that activate weakly) with logit-only knowledge distillation to transfer the teacher's probability distribution to the student
  • Integrates a full post-pruning alignment pipeline (SFT, DPO, GRPO) to recover instruction-following capabilities lost during the compression phase
Evaluation Highlights
  • Reduced model size by 33.4% (from 11.04B to 7.35B parameters) while recovering ~90% of the baseline model's performance
  • Achieved up to 50% inference speedup compared to the original Bielik-11B-v3.0 teacher model
  • Demonstrates that logit-only distillation can successfully recover linguistic fidelity for morphologically rich languages like Polish using <3% of original pre-training data
Breakthrough Assessment
7/10
Solid application of the Minitron framework to a new linguistic domain (Polish). While the core methodology is adapted from NVIDIA, the integration of GRPO and the specific focus on under-represented languages adds practical value.
×