Reasoning Core: A Scalable Procedural Data Generation Suite for Symbolic Pre-training and Post-Training

📝 Paper Summary

Synthetic Data Generation Symbolic Reasoning Pre-training Reinforcement Learning

Reasoning Core provides a scalable suite of procedurally generated, solver-verified symbolic tasks (like planning and logic) that improve language model reasoning when mixed into pre-training data.

Core Problem

Existing procedural data generators rely on narrow templates or fixed puzzles (e.g., just BlocksWorld), lacking the distributional breadth needed to instill general reasoning primitives during pre-training.

Why it matters:

Training on narrow distributions (e.g., single planning domains) fails to generalize to minor variations
Scaling reasoning capabilities requires verifiable data beyond web text, but prolonged RL is compute-intensive
Current suites like Reasoning Gym prioritize task count over the distributional generality required for effective pre-training

Concrete Example: Training on a single PDDL domain like BlocksWorld does not generalize to other planning problems. Reasoning Core instead samples randomized PDDL domains covering the full class of STRIPS problems.

Key Novelty

High-Generality Procedural Symbolic Suite

Generates data for foundational formal domains (planning, logic, equations) using randomized parameters rather than fixed templates to ensure broad distributional coverage
Integrates external solvers (theorem provers, planning engines) to provide rigorous verification and reward signals for every generated instance
Uses a continuous 'difficulty knob' to scale problem complexity (e.g., proof depth, plan length) for curriculum learning

Evaluation Highlights

Mixing Reasoning Core data (r=0.5 ratio) into pre-training consistently improves PlatinumBench reasoning performance across three different base corpora (FineWeb, Dolci, SYNTH)
Symbolic data integration preserves or slightly improves validation loss on general natural language modeling, avoiding the 'tax' often paid for reasoning specialization
Zero-shot evaluation confirms tasks remain challenging for frontier models like GPT-5, particularly at higher difficulty settings (knob level 5)

Breakthrough Assessment

8/10

Strong contribution to synthetic data infrastructure. Moves beyond templated puzzles to solver-verified, high-generality domains essential for scaling reasoning. The demonstration of pre-training gains without language degradation is significant.

⚙️ Technical Details

Problem Definition

Setting: Procedural generation of symbolic reasoning problems S = (x, y, trace, R) where x is the input, y is the solution, trace is the reasoning path, and R is a verifiable reward function.

Inputs: Random seed and difficulty parameter d ∈ [0, 1]

Outputs: Verifiable symbolic task instance (e.g., PDDL domain+problem, FOL theory+conjecture)

Pipeline Flow

Task Generator (Logic/Planning/Math)
Solver Verification
Trace Formatter
Data Mixer

System Modules

Task Generator

Procedurally create symbolic problems based on a continuous difficulty knob

Model or implementation: Algorithmic generators (Python + gramforge)

Solver Verification

Verify solvability and generate ground truth/traces using external tools

Model or implementation: External Solvers: Vampire/E (Logic), FastDownward (Planning), Sympy (Math)

Trace Formatter

Format solver outputs into readable chain-of-thought traces

Model or implementation: Rule-based formatter

Novel Architectural Elements

Integration of heavy external solvers (Theorem Provers, Planners) directly into the data generation pipeline for rigorous reward verification
Continuous difficulty control mechanism that maps a single float to diverse discrete parameters via stochastic rounding

Modeling

Base Model: Monad-56M (Pre-training), Ettin-68M (Instruction-tuning)

Training Method: Supervised Fine-Tuning (Pre-training mixer and Instruction Tuning)

Objective Functions:

Purpose: Standard language modeling.

Formally: Minimize Negative Log Likelihood (NLL) on target tokens.

Adaptation: Full training (Pre-training) / Fine-tuning (Post-training)

Trainable Parameters: All parameters (approx 56M-68M)

Training Data:

Natural Language Source: 0.5B tokens (FineWeb / Dolci / SYNTH)
Reasoning Core Mix: r * 0.5B tokens added (r ∈ {0, 0.1, 0.3, 0.5, 1.0})

Key Hyperparameters:

batch_size: 16
context_window: 1024 tokens
optimizer: Prodigy with Schedule-Free algorithm
+ 1 more
epochs: 1

Compute: Single Nvidia A30 GPU (approx 1 day per run)

Comparison to Prior Work

vs. Reasoning Gym: RC targets higher distributional generality (random PDDL vs. fixed puzzles) and includes solver-based verification for all tasks
vs. MathGenie/DeepSeek-Prover: RC is purely procedural (no LLM inference cost for generation) and focuses on foundational symbolic primitives rather than just math/code
vs. Dyck-language pre-training [cited]: RC covers full context-free grammars and logic, not just simple hierarchy checks

Limitations

Scope limited to formal/symbolic domains; transfer to unstructured reasoning (legal, scientific) is unverified
Experiments conducted at small scale (<100M parameters, 0.5B tokens); results might differ at frontier scale
No RLVR training curves presented; the paper focuses on data generation infrastructure rather than RL optimization
Verification pipeline not infallible; small fraction of bugs or edge cases may persist despite solver checks

Reproducibility

Code: https://github.com/sileod/reasoning_core

publicly available: Code (https://github.com/sileod/reasoning_core), Datasets (HuggingFace), and gramforge library. External solver wrappers provided via udocker/Apptainer. Specific experimental model weights not explicitly linked but architectures (Monad-56M, Ettin-68M) are cited.

📊 Experiments & Results

Evaluation Setup

Pre-training and instruction-tuning small transformers with mixed symbolic data

Benchmarks:

PlatinumBench (Reliability in math, logic, and table understanding)
General Language Modeling (Test Sets) (Next-token prediction on FineWeb/Dolci/SYNTH test splits)

Metrics:

Negative Log Likelihood (NLL) on answers
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Mixing Reasoning Core data consistently improves reasoning performance (lower NLL) on PlatinumBench across different base datasets.
PlatinumBench (FineWeb)	NLL (Lower is better)	0.86	0.78	-0.08
PlatinumBench (SYNTH)	NLL (Lower is better)	0.76	0.70	-0.06
PlatinumBench (Dolci)	NLL (Lower is better)	0.68	0.62	-0.06

Experiment Figures

Impact of mixing Reasoning Core data (ratio r) on PlatinumBench NLL and General Validation Loss

Zero-shot average reward of GPT-5 on Reasoning Core tasks at Easy (lvl 0) vs Hard (lvl 5) difficulty

Main Takeaways

Optimal mixing ratio appears to be r=0.5 (adding 50% symbolic tokens relative to original data size), minimizing reasoning NLL.
Symbolic data injection does not degrade general language modeling performance; in some cases (e.g., FineWeb), validation loss slightly improves.
Procedural generation allows creating pre-training scale data (billions of tokens) at negligible marginal cost compared to web scraping.
Zero-shot probing confirms the generated tasks are non-trivial, with significant performance drops for GPT-5 when moving from difficulty level 0 to 5.

📚 Prerequisite Knowledge

Prerequisites

Knowledge of formal logic (First-Order Logic) and planning (PDDL)
Understanding of procedural content generation (PCG)
Familiarity with Language Model pre-training and instruction tuning

Key Terms

PDDL: Planning Domain Definition Language—a standard encoding for AI planning problems involving states, actions, and goals

STRIPS: Stanford Research Institute Problem Solver—a formal language for automated planning problems, a subset of PDDL

TPTP: Thousands of Problems for Theorem Provers—a standard library and format for testing automated theorem provers

PCFG: Probabilistic Context-Free Grammar—a grammar where each production rule has a probability, used here for generating structured data

RLVR: Reinforcement Learning with Verifiable Rewards—training models using objective success signals (like passing a test case) rather than human preference labels

NLL: Negative Log Likelihood—a standard loss metric in language modeling representing how well the model predicts the correct next token

bushiness factor: A parameter in the generation algorithm that forces derivation trees to expand laterally (width) alongside vertical growth (depth) to ensure structural complexity

balancing key: A mechanism to cap the frequency of specific features (e.g., answer labels) in a batch to prevent degenerate distributions

Stochastic rounding: Probabilistically rounding a float to the nearest integer (e.g., 2.3 becomes 2 with 70% chance, 3 with 30%) to allow continuous control over discrete parameters