Making Bielik LLM Reason (Better): A Field Report

📝 Paper Summary

Reasoning in Large Language Models National/Regional LLM Development Multi-Agent Systems for Mathematics

The paper documents the development of Bielik-R and Bielik-M, enhancing a Polish language model's reasoning through reinforcement learning on verifiable tasks and a multi-agent solver pipeline.

Core Problem

Polish language models lag behind global frontier models in complex reasoning and formal logic, often hallucinating or failing to scale on multi-step tasks.

Why it matters:

Poland ranks near the bottom of EU AI adoption, missing out on the transition from predictive AI to autonomous research agents
Generic LLMs struggle with language-specific nuance in strict formal analysis, legal reasoning, and advanced mathematics
Single-model approaches often hit 'lost-in-the-middle' limits on long reasoning chains, requiring orchestrated multi-component systems

Concrete Example: When solving 'Einstein's Riddles' or formal logic puzzles, the base Bielik 2.3 model hallucinates contradictions when additional variables are introduced and fails to dynamically abandon initial assumptions when instructions change.

Key Novelty

Bielik-M Multi-Agent Solver & Bielik-R Training Pipeline

Training a dedicated 'thinking' model (Bielik-R) using a 3-stage pipeline: SFT on distilled traces, DPO alignment, and RL (GRPO/DAPO) on 143k Polish verifiable tasks
Deploying a multi-agent system (Bielik-M) for mathematics that decomposes problems into Analytical (method ID), Executor (SymPy), and Summary agents to solve exam-level problems

Evaluation Highlights

Bielik-R achieved 89% accuracy on a specialized First-Order Logic benchmark
Bielik-R achieved 80% accuracy on Propositional Calculus tasks
An 11B parameter model (Bielik-M) successfully solves Polish matura exam-level mathematics problems by leveraging agentic decomposition and symbolic verification

Breakthrough Assessment

5/10

Significant for the regional (Polish) ecosystem and demonstrates solid application of modern post-training (RLVR) and agentic patterns, but admits to 'lagging behind' frontier models globally.

⚙️ Technical Details

Problem Definition

Setting: Formal logic and mathematical reasoning tasks in Polish

Inputs: Natural language logic puzzles (e.g., Einstein's Riddles) or math problems

Outputs: Step-by-step reasoning trace (Chain-of-Thought) followed by a final answer

Pipeline Flow

Input Problem -> Analytical Agent (Method ID)
Executor Agent (Code Generation & Repair)
Summary Agent (Explanation)
Output

System Modules

Analytical Agent

Identifies the mathematical method required to solve the problem

Model or implementation: Bielik 3 (11B)

Executor Agent

Generates and executes symbolic code to solve the problem, with self-repair

Model or implementation: Bielik 3 (11B)

Summary Agent

Synthesizes the execution results into a pedagogical step-by-step explanation

Model or implementation: Bielik 3 (11B)

Lean 4 Formalizer

Optional agent to formalize the proof in the Lean 4 theorem prover

Model or implementation: Bielik 3 (11B)

Novel Architectural Elements

Integration of a TF-IDF RAG agent specifically for math method identification within a multi-agent loop
Coupling small-model (11B) reasoning with symbolic execution (SymPy) to bypass calculation errors typical in small LLMs

Modeling

Base Model: Bielik-11B-v2.6 (based on SpeakLeash)

Training Method: Hybrid pipeline: SFT -> DPO -> RL (GRPO/DAPO)

Objective Functions:

Purpose: Distinguish reasoning from final answer.

Formally: Use <think> and </think> delimiters in system message.

Training Data:

Stage 1 (SFT): 1.3 million distilled reasoning traces (English) from DeepSeek/Qwen
Stage 3 (RL): 143k Polish verifiable tasks (Math, Code, STEM, Logic)
RLVR Dataset: 490k Polish tasks with verified answers constructed for future iterations

Key Hyperparameters:

computational_requirements: Stage 2 training: 32x Nvidia GH200 GPUs; Initial PoC: 8x Nvidia H100 GPUs

Compute: Training used 32x Nvidia GH200 GPUs at Helios Academic Computer Centre

Comparison to Prior Work

vs. DeepSeek-R1: Bielik targets Polish-specific cultural/linguistic nuance and is a smaller (11B) model optimized for local deployment
vs. Standard LLMs (Gemma): Bielik utilizes a multi-agent symbolic executor (SymPy) pipeline for math, whereas standard baselines typically use direct generation

Limitations

The model still lags behind frontier models (DeepSeek, Gemma) on general reasoning benchmarks
Rigidity in dynamic belief revision; the model struggles to abandon initial assumptions when problem statements are modified mid-task
Evaluation methodology challenges, including manual analysis inefficiencies and difficulties in calibrating difficulty levels for logic puzzles

Reproducibility

Bielik-11B-v2.6-R is explicitly stated as an 'internal research artifact' and has not been publicly released. The paper mentions utilizing open datasets (AIME, AMC, MATH-500) translated to Polish.

📊 Experiments & Results

Evaluation Setup

Benchmarking on formal logic, riddles, and math problems, primarily in Polish.

Benchmarks:

Einstein's Riddles (Constraint satisfaction / Logic puzzles)
Formal Logic Suite (Propositional and First-Order Logic validation) [New]
Polish Math Benchmark (Mathematics (translations of AIME, AMC, MATH-500)) [New]

Metrics:

Accuracy (0 / 0.5 / 1 scale)
Pass rate (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
First-Order Logic Suite	Accuracy	Not reported in the paper	89	Not reported in the paper
Propositional Calculus Suite	Accuracy	Not reported in the paper	80	Not reported in the paper

Main Takeaways

Bielik-R shows strong competence in formal logic (80-89%) when explicitly trained for it, though it lags behind frontier models in general benchmarks.
The 'lost-in-the-middle' effect is a primary failure mode for the model in long-context reasoning tasks like Einstein's Riddles.
Multi-agent orchestration (Bielik-M) allows a relatively small model (11B) to solve exam-level math problems by offloading calculation to symbolic tools (SymPy).

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human/AI Feedback (RLHF)
Formal Logic (Propositional and Predicate Calculus)
Multi-Agent System Architectures

Key Terms

SFT: Supervised Fine-Tuning—training a model on high-quality examples of inputs and desired outputs

DPO: Direct Preference Optimization—an algorithm for aligning models to human preferences without training a separate reward model

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that optimizes policies based on group-level performance comparisons

DAPO: An RL algorithm (likely Dual/Direct Alignment Policy Optimization) used here in the production pipeline for reasoning

RLVR: Reinforcement Learning with Verifiable Rewards—training models on tasks where the final answer can be programmatically checked (e.g., math, code)

Chain-of-Thought: A prompting technique where the model generates intermediate reasoning steps before the final answer

RAG: Retrieval-Augmented Generation—fetching relevant external data to ground the model's generation

SymPy: A Python library for symbolic mathematics, used here by the Executor agent to perform exact calculations