Procedural Pretraining: Warming Up Language Models with Abstract Data

📝 Paper Summary

Language Model Pretraining Synthetic Data for LLMs

Pretraining language models on small amounts of abstract, algorithmically generated data (like formal languages) significantly improves performance and data efficiency on downstream tasks like coding and math.

Core Problem

LLMs simultaneously learn facts and reasoning skills from the same data, leading to entangled representations where models rely on surface-level heuristics rather than systematic reasoning.

Why it matters:

Current models struggle to generalize because they memorize semantic shortcuts instead of learning robust algorithmic procedures.
Standard pretraining is data-hungry; finding ways to 'warm up' models with cheaper, synthetic data could drastically reduce training costs.

Concrete Example: A model trained only on natural language might struggle with 'Needle-in-a-haystack' retrieval because it relies on semantic associations rather than position-aware processing. The paper shows pretraining on balanced brackets (Dyck sequences) jumps accuracy from 10% to 98%.

Key Novelty

Procedural Pretraining as Algorithmic Scaffolding

Expose models to abstract, structured data (e.g., sorting numbers, formal languages) *before* standard semantic pretraining.
This 'warms up' specific neural mechanisms: attention layers learn structure (useful for code), while MLPs learn pattern matching (useful for language), without requiring expensive semantic data.

Architecture

Illustration of the 'entangled' learning problem in standard LLMs vs. the 'disentangled' approach of procedural pretraining.

Evaluation Highlights

On context recall (Needle-in-a-haystack), procedural pretraining improves accuracy from 10% to 98%.
Procedural pretraining enables models to reach baseline loss on C4 (natural language) using only 55% of the original semantic data.
Adding just 0.1% procedural data consistently outperforms standard pretraining baselines on CodeParrot and DeepMind-Math.

Breakthrough Assessment

8/10

Strong empirical evidence that cheap synthetic data can replace large chunks of expensive real data. The finding that different layers (Attention vs MLP) benefit from different data types is theoretically significant.

⚙️ Technical Details

Problem Definition

Setting: Pretraining Decoder-only Transformers on procedural data followed by semantic data

Inputs: Sequences of tokens from procedural generators (e.g., formal languages) or semantic corpora

Outputs: Next-token prediction logits

Pipeline Flow

Procedural Data Generation (Algorithm)
Procedural Pretraining (Phase 1)
Semantic Pretraining (Phase 2)
Downstream Evaluation/Fine-tuning

System Modules

Data Generator

Generate synthetic sequences using algorithms like Sort, Stack, or Rule 110

Model or implementation: Deterministic Algorithms (Python scripts)

Pretrained Model (Phase 1) (Training)

Learn algorithmic structures from procedural data

Model or implementation: GPT-2 (Decoder-only Transformer)

Target Model (Phase 2) (Training)

Transfer algorithmic skills to semantic domains (Language/Code)

Model or implementation: GPT-2 (initialized with Phase 1 weights)

Novel Architectural Elements

Selective Layer Transfer: A transfer learning protocol where only specific layers (e.g., Attention or MLP) are retained from the procedural pretraining phase, while others are re-initialized.

Modeling

Base Model: GPT-2 (Small to XL sizes up to 1.3B parameters)

Trainable Parameters: All weights are trainable during both phases (nothing frozen)

Training Data:

Procedural Data: Sort, Reverse, Stack, Dyck-k, ECA Rule 110
Semantic Data: C4, WikiText, CodeParrot, DeepMind-Math, JavaCorpus

Key Hyperparameters:

model_sizes: Small (124M) up to 1.3B parameters
pretraining_tokens_T1: 0.1% to ~3% of semantic dataset size (e.g., 2M-20M tokens)
semantic_tokens_T2: Up to 10.5B tokens

Compute: Not reported in the paper

Comparison to Prior Work

vs. Hu et al. (2025): This paper uses procedural data as a distinct 'warm-up' stage for diverse domains (code/math), not just to imitate language properties.
vs. Wu et al. (2022): This paper focuses on the *complementary* nature of procedural data (imparting new skills) rather than just substituting real data.
vs. Standard Pretraining: Demonstrates that non-linguistic, abstract data can improve linguistic performance via transfer learning.

Limitations

Experiments limited to GPT-2 architectures (up to 1.3B parameters); scaling to massive modern LLMs (7B+) not tested.
Requires careful selection of procedural tasks; not all algorithms transfer well to all domains.
Optimal mixture of procedural data types is explored only as a proof-of-concept.

Reproducibility

Procedural data generation algorithms are described in detail (Appendix B). Model architecture is standard GPT-2. Code URL is not explicitly provided in the text.

📊 Experiments & Results

Evaluation Setup

Pretrain on procedural data -> Train on target domain -> Evaluate perplexity or downstream task accuracy

Benchmarks:

Needle-in-a-haystack (Context Recall)
WikiText / C4 (Language Modeling (Perplexity))
CodeParrot / JavaCorpus (Code Generation / Modeling)
DeepMind-Math (Informal Mathematics)

Metrics:

Accuracy
Perplexity (Loss)
Pass@k
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Algorithmic probing tasks show massive improvements in specific reasoning skills after procedural pretraining.
Needle-in-a-haystack (Context Recall)	Accuracy	10	98	+88
Data efficiency experiments demonstrate that procedural data can replace significant amounts of semantic data.
C4 (Natural Language)	Data Requirement (%)	100	55	-45
CodeParrot (Code)	Data Requirement (%)	100	67	-33
DeepMind-Math	Data Requirement (%)	100	86	-14
Layer-wise transfer analysis reveals distinct roles for Attention and MLP layers.
Needle-in-a-haystack	Accuracy (Diff vs Full Transfer)	0	80	+80

Experiment Figures

Heatmap of performance gains on algorithmic tasks (rows) given different procedural pretraining data types (columns).

Comparison of Full Transfer vs. Attention-Only vs. MLP-Only transfer on algorithmic tasks.

Scaling laws for procedural pretraining (Additive and Substitutive settings).

Main Takeaways

Procedural pretraining is a highly efficient 'warm-up' that consistently improves downstream performance with negligible compute cost (0.1% added data).
Different domains benefit from different layers: Code/Math benefit most from Attention transfer (structure), while Language benefits most from MLP transfer (pattern matching).
The benefits are not merely due to initialization scaling or attention sharpening, but stem from learned algorithmic structures.
Gains persist even after extensive downstream fine-tuning, indicating a lasting improvement in the model's inductive biases.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Attention vs. MLP roles)
Language model pretraining pipelines
Formal languages (Dyck languages)

Key Terms

Procedural Pretraining: Initial training phase using data generated by explicit algorithms (e.g., sorting, formal languages) before standard training.

Dyck sequences: Strings of balanced parentheses (e.g., '(()())'), used to teach models nested structure and memory.

Needle-in-a-haystack: A task testing a model's ability to retrieve a specific piece of information ('needle') buried in a long context ('haystack').

Cellular Automata: Discrete computational systems (like Rule 110) where cells evolve based on local rules, used here to generate complex logical patterns.

Additive setting: Experiments where procedural data is added to a fixed amount of semantic data to measure performance gains.

Substitutive setting: Experiments where procedural data replaces a portion of semantic data to measure data efficiency.

MLP-only transfer: Initializing only the Multilayer Perceptron weights from the procedural model while randomizing attention weights.

Attention-only transfer: Initializing only the Attention weights from the procedural model while randomizing MLP weights.