Bielik-Minitron-7B: Compressing Large Language Models via Structured Pruning and Knowledge Distillation for the Polish Language

📝 Paper Summary

Model Compression Large Language Models (LLMs)

Bielik-Minitron-7B compresses an 11B Polish language model to 7B via hybrid structured pruning and logit-based distillation, maintaining linguistic competence while significantly reducing deployment costs.

Core Problem

Deploying high-performance Large Language Models (LLMs) for specific European languages like Polish requires excessive computational resources (VRAM), while training smaller models from scratch is prohibitively expensive.

Why it matters:

High-performance reasoning models usually exceed the memory capacity of consumer-grade hardware (e.g., NVIDIA RTX 4090), limiting local adoption
Training language-specific models from scratch has a massive carbon footprint and financial cost compared to compressing existing flagship models
Current English-centric compression research often neglects the morphological complexity of languages like Polish, necessitating tailored pruning strategies

Concrete Example: A standard 11B parameter model requires enterprise-grade GPUs to run efficiently. By compressing it to 7B, the model becomes deployable on consumer hardware, but naive pruning destroys its ability to handle Polish grammar (inflections, cases) unless carefully aligned via distillation.

Key Novelty

Two-Stage Compression with Hybrid Pruning and Alignment

Applies NVIDIA's Minitron approach to prune along four axes simultaneously (depth, width, attention heads, MLP size) rather than just one, preserving the most critical circuits for Polish reasoning
Combines activation-based importance estimation (pruning weights that activate weakly) with logit-only knowledge distillation to transfer the teacher's probability distribution to the student
Integrates a full post-pruning alignment pipeline (SFT, DPO, GRPO) to recover instruction-following capabilities lost during the compression phase

Evaluation Highlights

Reduced model size by 33.4% (from 11.04B to 7.35B parameters) while recovering ~90% of the baseline model's performance
Achieved up to 50% inference speedup compared to the original Bielik-11B-v3.0 teacher model
Demonstrates that logit-only distillation can successfully recover linguistic fidelity for morphologically rich languages like Polish using <3% of original pre-training data

Breakthrough Assessment

7/10

Solid application of the Minitron framework to a new linguistic domain (Polish). While the core methodology is adapted from NVIDIA, the integration of GRPO and the specific focus on under-represented languages adds practical value.

⚙️ Technical Details

Problem Definition

Setting: Compression of a pre-trained Large Language Model teacher into a smaller student model via pruning and distillation

Inputs: Input prompt tokens (Polish/English)

Outputs: Generated text completion or response

Pipeline Flow

Input Processing (Tokenization)
Compressed Transformer Blocks (Pruned Layers & Dimensions)
Output Generation (Head)

System Modules

Compressed Transformer Body

Process input embeddings through a reduced number of layers and narrower matrices

Model or implementation: Pruned Bielik-Minitron-7B (derived from Bielik-11B-v3.0)

Alignment Head

Generate final tokens, refined via RLHF to follow instructions

Model or implementation: Final Linear Layer (vocabulary projection)

Novel Architectural Elements

Hybrid pruned architecture: Simultaneously reduced depth, embedding dimension, attention heads, and MLP intermediate dimension based on activation importance

Modeling

Base Model: Bielik-11B-v3.0 (Teacher)

Training Method: Two-stage compression: (1) Structured Pruning, (2) Knowledge Distillation + Alignment (SFT, DPO, GRPO)

Objective Functions:

Purpose: Mimic the teacher's full probability distribution.

Formally: Forward KL Divergence Loss = sum(P_teacher(y|x) * log(P_teacher(y|x) / P_student(y|x)))

Trainable Parameters: 7.35B (Student) initialized from 11.04B (Teacher)

Training Data:

Calibration data for pruning importance estimation
Original pre-training data subset (<3%) for distillation

Key Hyperparameters:

pruning_method: NVIDIA Minitron (Block Influence for depth, Activation L2-norm for width)
distillation_type: Logit-only

Compute: Pruning/Distillation performed on NVIDIA DGX Cloud Lepton with H200 GPUs

Comparison to Prior Work

vs. SparseGPT/Wanda: Bielik-Minitron uses structured pruning (hardware-friendly) rather than unstructured sparsity
vs. ShortGPT: Prunes along multiple axes (width+depth+heads) rather than just depth, allowing finer control [not cited in paper but implied by comparison]
vs. Training from scratch: Uses distillation from a larger teacher to preserve linguistic nuance with a fraction of the compute

Limitations

Distillation requires access to the original large teacher model and its pre-training data
Recovering 100% of the teacher's performance is difficult; ~10% gap remains
The approach relies heavily on the quality of the teacher model; flaws in Bielik-11B would be distilled into the 7B version

Reproducibility

Methodology utilizes NVIDIA NeMo Framework and Model Optimizer (public tools). Specific pruning sensitivities and the final 'Golden Ratio' configuration (EXP_010) are described. The specific code repository for the Bielik model itself is not linked in the text snippet.

📊 Experiments & Results

Evaluation Setup

Comparison of parameter count and general performance recovery against the teacher model

Benchmarks:

Parameter Count (Efficiency Metric)

Metrics:

Parameter Count
Inference Speedup
Performance Recovery (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Model Size	Parameters (Billions)	11.04	7.35	-3.69

Main Takeaways

The compression pipeline successfully reduced the model size by over 33% while retaining approximately 90% of the original model's quality (specific benchmark scores not in text snippet).
Inference speed improved by up to 50%, making the model viable for consumer-grade GPUs.
Hybrid pruning (combining depth, width, and attention reduction) followed by logit distillation is effective for morphologically rich languages like Polish.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Transformer architecture (Attention, MLPs, Residuals)
Familiarity with Knowledge Distillation (KL divergence, logits)
Basics of Model Pruning (Structured vs. Unstructured)

Key Terms

Structured Pruning: Removing entire architectural components (layers, heads, neurons) rather than individual weights, resulting in a smaller dense model that runs faster on standard hardware

Knowledge Distillation: A training process where a small 'student' model learns to mimic the output probabilities (logits) of a larger 'teacher' model

Logits: The raw, unnormalized prediction scores generated by a neural network before applying the softmax function

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm used to align the model's behavior with human preferences

DPO: Direct Preference Optimization—a stable method for fine-tuning LLMs on preference pairs (better/worse outputs) without a separate reward model

Block Influence: A metric used to determine which Transformer layers can be removed; layers that transform the input the least (high cosine similarity between input/output) are pruned

SFT: Supervised Fine-Tuning—training the model on high-quality instruction-response pairs

KL Divergence: Kullback-Leibler Divergence—a statistical measure quantifying how much one probability distribution (student) differs from another (teacher)

H200: NVIDIA H200 Tensor Core GPU—high-performance hardware used for the training/distillation process in this paper