PretrainZero: Reinforcement Active Pretraining

📝 Paper Summary

Reinforcement Learning for Pretraining Active Learning

PretrainZero applies reinforcement learning directly to general pretraining data by actively generating and predicting masked spans, enabling reasoning capabilities to emerge without supervised fine-tuning or external reward models.

Core Problem

Applying reinforcement learning (RL) to general pretraining is difficult because standard methods rely on verifiable rewards (like math answers) which are scarce in general corpora, or inefficient fixed masking strategies.

Why it matters:

Post-training RL (like RLHF) hits a 'data-wall' due to reliance on expensive human labels or domain-specific verifiers
Existing reinforcement pretraining attempts fail on real-world noisy data because random or entropy-based token selection creates unstable or trivial learning signals
Standard self-supervised learning (next-token prediction) captures patterns but does not explicitly incentivize the reasoning chains seen in expert models

Concrete Example: When using entropy-based masking on noisy Wikipedia data, the model might select a high-entropy token that is simply noise or formatting rather than a reasoning target. This causes 'training collapse' where rewards degrade to zero because the target is unpredictable, unlike in clean synthetic datasets where high entropy correlates with difficulty.

Key Novelty

Reinforcement Active Pretraining

Mimics human active learning by training a 'mask generator' policy to actively find informative, learnable spans within the data, rather than using random masking
Simultaneously trains a 'reasoner' policy to recover these masked spans via Chain-of-Thought, using exact-match with the original text as a ground-truth verifier
Formulates this as a min-max game where the generator tries to create challenging masks (lowering prediction accuracy) while the reasoner tries to solve them

Architecture

The Reinforcement Active Pretraining framework showing the interaction between Mask Generation and Mask Prediction.

Evaluation Highlights

+10.60 average improvement on math benchmarks for Qwen3-4B-Base after reinforcement pretraining, without any SFT data
+8.43 improvement on MMLU-Pro for Qwen3-4B-Base, demonstrating generalization to complex reasoning beyond simple completion
+3.04 improvement on SuperGPQA after post-training, showing that the pretrained reasoning capabilities transfer to downstream RLVR tasks

Breakthrough Assessment

8/10

Significant step in removing the reliance on SFT and external reward models for reasoning. Successfully applies RLVR-style learning to raw pretraining data via active learning, showing strong empirical gains.

⚙️ Technical Details

Problem Definition

Setting: Self-supervised reinforcement learning on general text corpus (Wikipedia)

Inputs: Raw text sequence s from pretraining distribution D

Outputs: Predicted tokens for a masked span x_hat, generated via Chain-of-Thought

Pipeline Flow

Mask Generation Policy (selects span m given sequence s)
Mask Prediction Policy (generates CoT and predicts span x_hat given s with mask m)
Verifier (compares x_hat with ground truth span from s)

System Modules

Mask Generator

Actively selects a span of text to mask that is verifiable and informative (challenging but solvable)

Model or implementation: Shared Transformer Base (e.g., Qwen3-4B-Base)

Mask Predictor

Recovers the masked span by generating a reasoning chain followed by the token prediction

Model or implementation: Shared Transformer Base (e.g., Qwen3-4B-Base)

Novel Architectural Elements

Bilevel optimization framework where a single shared LLM acts as both problem creator (masker) and solver (predictor) within the same batch
Integration of active learning objectives directly into the RL pretraining loop via the min-max formulation

Modeling

Base Model: Qwen3-4B-Base, Qwen3-8B-Base, Qwen3-30B-A3B-MoE-Base, SmolLM3-3B-Base

Training Method: GRPO (Group Relative Policy Optimization)

Objective Functions:

Purpose: Maximize reasoner performance on recovering masked spans.

Formally: Maximize E[R(s, m, x_hat)] where R is exact match.
Purpose: Minimize reasoner performance (to find harder masks) while avoiding impossible noise.

Formally: Minimize E[R(s, m, x_hat)] via generator policy.
Purpose: Joint min-max optimization.

Formally: min_omega' max_omega J(omega, omega') where omega are shared parameters.

Adaptation: Full parameter update (initially base model)

Trainable Parameters: All parameters (shared between generator and predictor)

Training Data:

General Wikipedia corpus (no synthetic CoT or QA pairs)

Key Hyperparameters:

group_size: Not explicitly reported in the paper
learning_rate: Not explicitly reported in the paper
batch_size: Not explicitly reported in the paper

Comparison to Prior Work

vs. RPT: Uses active mask generation instead of fixed heuristics; trains on noisy real-world Wikipedia instead of clean synthetic OmniMath
vs. Self-Play RL: Uses ground-truth text spans as 'verifiable' rewards instead of model consensus/voting, avoiding hallucination loops
vs. Masked Language Modeling (BERT/SpanBERT) [not cited in paper]: Adds RL and Chain-of-Thought generation to the reconstruction task rather than simple likelihood maximization

Limitations

Computational overhead of generating masks and reasoning chains for every training sample compared to standard NTP
Performance gains currently demonstrated primarily on reasoning/math benchmarks; impact on other capabilities less explored
Requires careful balancing of the min-max game to prevent generator from creating impossible masks (addressed by zero-rewarding unsolved masks, but still a sensitivity)

Reproducibility

Method is described in detail, including the min-max objective and reward definitions. Base models (Qwen3) and dataset (Wikipedia) are public. Code availability is not provided.

📊 Experiments & Results

Evaluation Setup

Pretraining on Wikipedia followed by optional post-training RLVR

Benchmarks:

MMLU-Pro (General Reasoning / Knowledge)
SuperGPQA (Graduate-Level Reasoning)
Math Average (Mathematical Reasoning)

Metrics:

Accuracy (assumed, specific metric type not explicitly named but implied by scores)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results showing improvement of PretrainZero (Reinforcement Pretraining stage only) over the vanilla Base model.
MMLU-Pro	Score	Not explicitly reported in the paper	Not explicitly reported in the paper	+8.43
SuperGPQA	Score	Not explicitly reported in the paper	Not explicitly reported in the paper	+5.96
Math Average	Score	Not explicitly reported in the paper	Not explicitly reported in the paper	+10.60
Results after applying general RLVR Post-Training, comparing the PretrainZero initialization vs. standard Base initialization.
MMLU-Pro	Score	Not explicitly reported in the paper	Not explicitly reported in the paper	+2.35
SuperGPQA	Score	Not explicitly reported in the paper	Not explicitly reported in the paper	+3.04
Math Average	Score	Not explicitly reported in the paper	Not explicitly reported in the paper	+2.81

Experiment Figures

Comparison of training dynamics (Reward vs Steps) for Random NPT, Random Mask, and Entropy NPT on Wikipedia.

Performance deltas on benchmarks (MMLU-Pro, SuperGPQA, Math) for PretrainZero vs Base model, both before and after Post-Training.

Main Takeaways

Active masking (PretrainZero) significantly outperforms random and entropy-based masking on real-world noisy data (Wikipedia), where entropy-based methods lead to collapse.
PretrainZero enables the emergence of reasoning capabilities (Chain-of-Thought) directly from base models without SFT cold-start or reward models.
Improvements gained during reinforcement pretraining persist and stack with improvements from standard post-training RLVR.
The method scales effectively to real-world pretraining distributions, overcoming the limitation of prior RLPT work that relied on synthetic data.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) basics
Language Model Pretraining (Masked/Next Token Prediction)
Chain-of-Thought (CoT) reasoning

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—RL where the reward signal comes from an objective, checkable outcome (like a correct math answer) rather than a neural reward model

RLPT: Reinforcement Learning Pre-Training—applying RL algorithms during the pretraining phase rather than just post-training

SFT: Supervised Fine-Tuning—training on high-quality labeled data (instructions/responses) usually required before RL; this paper eliminates it

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of samples to stabilize training without a critic model

CoT: Chain-of-Thought—a reasoning strategy where the model generates intermediate steps before the final answer

Active Learning: A machine learning paradigm where the algorithm actively selects which data points to learn from, typically choosing the most informative or uncertain ones