RLP: Reinforcement as a Pretraining Objective

📝 Paper Summary

LLM Pretraining Objectives Reasoning Capabilities Chain-of-Thought (CoT)

RLP integrates reasoning into the pretraining phase by rewarding the model for generating internal thoughts that improve the prediction of the next token compared to a no-thought baseline.

Core Problem

Standard next-token prediction pretraining forces models to reason linearly and implicitly, failing to encourage explicit, multi-step thinking or integration with world knowledge before output generation.

Why it matters:

Current reasoning abilities are typically induced only during post-training (SFT, RLHF), limiting the model's fundamental grounding in logic during its primary learning phase.
Human comprehension is non-linear and integrates priors in parallel; standard pretraining lacks mechanisms to mimic this, resulting in models that struggle with complex reasoning without heavy fine-tuning.

Concrete Example: In standard training, a model predicts the answer to a math problem token-by-token immediately. If the problem requires intermediate steps (e.g., '15 * 12'), the model might guess '180' by rote memorization or fail. RLP forces it to generate '10*15=150, 2*15=30, 150+30=' as a thought before predicting '180', rewarding this thought if it makes '180' more probable.

Key Novelty

Reinforcement Learning Pre-training (RLP)

Treats generating a Chain-of-Thought (CoT) as a latent action taken before predicting the next token, rewarding thoughts that increase the likelihood of the correct next token.
Uses a 'no-think' baseline (an Exponential Moving Average of the model) to measure information gain, creating a dense, verifier-free reward signal applicable to any text document.

Architecture

Illustration of the RLP training process compared to standard Next-Token Prediction.

Evaluation Highlights

+19% average improvement on 8 math/science benchmarks for qwen3-1.7b-base pretrained with RLP compared to the standard base model.
+35% relative improvement for Nemotron-Nano-12B-v2 on overall benchmarks using only 0.125% of the data compared to a heavily trained baseline.
Gains persist after strong post-training (SFT + RLVR), with the RLP model outscoring the continuously pretrained baseline by 7% on average.

Breakthrough Assessment

9/10

Moves reasoning training from post-training to pretraining using a scalable, verifier-free objective. Demonstrates massive gains on base models that compound with further training.

⚙️ Technical Details

Problem Definition

Setting: LLM Pretraining on general text corpora augmented with latent reasoning steps

Inputs: Text sequence x_<t

Outputs: Next token x_t, conditioned on a sampled thought c_t

Pipeline Flow

Thought Policy (samples latent thought c_t given context)
Reasoned Predictor (predicts x_t given context + thought)
No-Think Baseline (predicts x_t given context only via EMA teacher)
Reward Computation (compares log-likelihoods)

System Modules

Thought Policy / Predictor

Generates the Chain-of-Thought and predicts the next token

Model or implementation: qwen3-1.7b-base or Nemotron-Nano-12B-v2

No-Think Baseline

Provides a counterfactual prediction score without reasoning

Model or implementation: EMA Teacher (copy of main model with lagged weights)

Novel Architectural Elements

Integration of a latent 'thought' generation step into the standard pretraining loop for every token (or sampled tokens)
Dual-head usage: same parameters used for thought generation and next-token prediction, differentiated by the presence of the thought in the context

Modeling

Base Model: qwen3-1.7b-base (Transformer) and Nemotron-Nano-12B-v2 (Hybrid Mamba-Transformer)

Training Method: Reinforcement Learning Pre-training (RLP) followed by SFT and RLVR

Objective Functions:

Purpose: Maximize information gain from thoughts.

Formally: Maximize J(theta) = E[r(c_t)] where r(c_t) = log p_theta(x_t | x_<t, c_t) - log p_phi(x_t | x_<t)
Purpose: Stabilize updates.

Formally: Clipped surrogate loss L_clip(theta) = - E [ min( rho * A, clip(rho, 1-eps, 1+eps) * A ) ]

Training Data:

General pretraining corpora: Academic papers (ACAD), Math textbooks, Web Crawl QA pairs
SFT-style reasoning corpora: OmniMath, OpenThoughts, Nemotron-Crossthink

Key Hyperparameters:

thought_samples_G: >= 2
EMA_decay: Not explicitly reported in the paper
clip_epsilon: Not explicitly reported in the paper
+ 1 more
RLP_training_tokens: 1B tokens (qwen3), 250M tokens (Nemotron)

Compute: Not reported in the paper

Comparison to Prior Work

vs. RPT: RLP provides a continuous, dense reward at every position and does not require proxy models or filtering 'easy' tokens.
vs. CPT: RLP explicitly models thoughts as latent variables and optimizes for information gain, whereas CPT optimizes only final token likelihood.
vs. STaR [not cited in paper]: STaR filters reasoning chains based on answer correctness (binary), whereas RLP uses continuous information gain on next-token prediction as the signal.

Limitations

Computational overhead of sampling multiple thoughts (G >= 2) per token during training.
Requires teacher forcing for reward computation, which may differ from inference-time generation.
Experiments limited to relatively small models (1.7B) and one mid-sized hybrid model (12B).
Specific hyperparameters (learning rate, EMA decay) are referenced as being in the Appendix but not in the main text.

Reproducibility

Prompt for RLP and hyperparameters mentioned as being in Appendix 10 (not provided in text). Code not provided. Datasets (OmniMath, OpenThoughts) are public.

📊 Experiments & Results

Evaluation Setup

Zero-shot and few-shot evaluation on math and science benchmarks after pretraining, and after subsequent post-training.

Benchmarks:

GSM8K (Math Word Problems)
MATH-500 (Challenging Math Problems)
AIME25 (Olympiad Math)
MMLU (General Knowledge & Reasoning)
GPQA-Diamond (Graduate-Level Science QA)

Metrics:

Pass@1 accuracy
Greedy accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of RLP against baselines (Base and Continuous Pretraining) on qwen3-1.7b-base without post-training.
Overall Average	Average Score	30.32	36.03	+5.71
Overall Average	Average Score	30.85	36.03	+5.18
AIME25	Pass@1	3.96	5.02	+1.06
Performance after identical Post-Training (SFT + RLVR) showing that RLP gains compound.
Overall Average	Average Score	39.90	42.51	+2.61
Science Avg	Average Score	42.73	45.74	+3.01
Scalability results on Nemotron-Nano-12B-v2.
Overall Average	Average Score	42.81	61.32	+18.51

Main Takeaways

RLP significantly outperforms both standard base models and continuous pretraining baselines, particularly on reasoning-heavy tasks.
The benefits of RLP are not washed out by post-training (SFT+RLVR); instead, they provide a better foundation that yields higher final performance.
The method is data-efficient, showing massive gains on Nemotron-Nano-12B with a fraction of the training data (0.125%).
RLP works as a dense, verifier-free signal that can be applied to general corpora, not just curated reasoning datasets.

📚 Prerequisite Knowledge

Prerequisites

Language Model Pretraining (Next-Token Prediction)
Reinforcement Learning (Policy Gradients, PPO-style objectives)
Chain-of-Thought (CoT) Prompting
Exponential Moving Average (EMA)

Key Terms

RLP: Reinforcement Learning Pre-training—the proposed method of rewarding thoughts that improve next-token prediction during pretraining

CoT: Chain-of-Thought—intermediate reasoning steps generated by the model to help solve a problem

EMA: Exponential Moving Average—a technique where model weights are updated slowly over time to create a stable reference (teacher) model

NTP: Next-Token Prediction—the standard objective function for training language models

SFT: Supervised Fine-Tuning—training on labeled input-output pairs (e.g., instruction following)

RLVR: Reinforcement Learning with Verifier Rewards—using an external checker (like a code compiler or math solver) to provide feedback

clipped surrogate: A loss function used in PPO (Proximal Policy Optimization) that prevents the model from changing too much in one update step

information gain: The difference in log-likelihood of the correct token between the reasoning model and the no-thought baseline

teacher forcing: Training technique where the model is fed the actual ground truth tokens as history, rather than its own previous predictions