Training Language Models via Neural Cellular Automata

📝 Paper Summary

Synthetic Data for LLM Pre-training Pre-pre-training strategies

Pre-pre-training language models on synthetic data generated by Neural Cellular Automata (NCA) instills transferable computational priors that improve downstream performance and convergence on natural language tasks.

Core Problem

Natural language data is finite, biased, and entangles reasoning with knowledge, making it difficult to isolate and train pure reasoning capabilities efficiently.

Why it matters:

High-quality natural text data may be exhausted by 2028 according to scaling laws
Natural language corpora require expensive curation and contain undesirable human biases
Current synthetic data approaches (random strings, simple grammars) often fail to match the performance of natural language training under matched budgets

Concrete Example: Training on random strings or simple formal languages (like Dyck-k) often yields poor transfer to real language tasks because these distributions lack the rich, long-range spatiotemporal structures found in natural text.

Key Novelty

NCA Pre-pre-training

Use Neural Cellular Automata (NCA) to generate synthetic, non-linguistic token sequences that exhibit rich, controllable spatiotemporal patterns and Zipfian statistics
Train models on this synthetic data first (pre-pre-training) to learn general computational primitives like rule inference and long-range dependency tracking before seeing any natural language
Filter NCA rules by gzip compression ratio to tune data complexity for specific downstream domains (e.g., lower complexity for code, higher for math)

Architecture

Visualization of NCA rollouts with varying complexity levels, alongside their gzip compression ratios.

Evaluation Highlights

Improves downstream perplexity on OpenWebText by up to 5.7% (1.6B model) compared to training from scratch
Accelerates convergence by up to 1.6× on web text, math, and code datasets compared to scratch baselines
Outperforms pre-pre-training on natural language (C4) by 5% perplexity even when the C4 baseline uses 10× more data (1.6B vs 160M tokens)

Breakthrough Assessment

8/10

Demonstrates that non-linguistic synthetic data can outperform natural language for acquiring core computational priors, offering a path to bypass data scarcity limits. The 10x data efficiency vs C4 is particularly significant.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive next-token prediction on tokenized 2D grid trajectories

Inputs: Sequence of tokens representing flattened 2D grid states of an NCA rollout

Outputs: Predicted next token in the sequence

Pipeline Flow

NCA Data Generation (Sample rule θ → Rollout → Tokenize)
Pre-pre-training (Train Transformer on NCA tokens)
Pre-training (Train on Natural Language)
Fine-tuning (Task-specific adaptation)

System Modules

NCA Generator

Generate synthetic training data with controllable complexity

Model or implementation: 3x3 Convolution + MLP (hidden size 16)

Tokenizer

Convert 2D grid updates into 1D token sequences

Model or implementation: Patch-based tokenization (2x2 patches)

Transformer Model

Learn latent dynamics and computational priors

Model or implementation: Llama-based Transformer (1.6B params)

Novel Architectural Elements

Use of gzip-filtered NCA trajectories as a pre-pre-training substrate to systematically control data complexity

Modeling

Base Model: Llama-based transformer (1.6B parameters, 24 layers, 32 heads, 2048 hidden dim)

Training Method: Standard autoregressive cross-entropy training

Objective Functions:

Purpose: Maximize probability of correct next token.

Formally: Minimize Cross-Entropy Loss L = -sum(log P(x_t | x_<t))

Training Data:

164M synthetic NCA tokens for pre-pre-training
9B tokens OpenWebText for pre-training
4B tokens OpenWebMath for pre-training
13B tokens CodeParrot for pre-training

Key Hyperparameters:

nca_grid_size: 12x12
nca_states: 10
patch_size: 2x2
+ 3 more
compression_threshold: >50%
vocab_size: 10000
sequence_length: 1024

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. Dyck-k: NCA provides richer spatiotemporal structure and Zipfian statistics closer to natural language
vs. Random/CFG: NCA offers explicit control over complexity via compression ratios
vs. C4 (Natural Language): NCA outperforms C4 pre-pre-training even with significantly less data (164M vs 1.6B tokens)

Limitations

Relative gains decrease as model scale increases (8.6% at 400M vs 5.7% at 1.6B)
Requires careful tuning of NCA complexity (compression ratio) to match downstream domains
Currently only explored as a pre-pre-training step, not full pre-training replacement
Embeddings must be re-initialized when switching from NCA to natural language

Reproducibility

Code: https://github.com/danihyunlee/nca-pre-pretraining

Code is publicly available at https://github.com/danihyunlee/nca-pre-pretraining. Hyperparameters for pre-training are detailed in Appendix B. NCA generation details (grid size, architecture) are fully specified.

📊 Experiments & Results

Evaluation Setup

Pre-pre-train on NCA, then Pre-train on domain corpora, then evaluate perplexity or fine-tune for reasoning tasks.

Benchmarks:

OpenWebText (Language Modeling)
OpenWebMath (Math Language Modeling)
CodeParrot (Code Language Modeling)
GSM8K (Math Reasoning)
HumanEval (Code Generation)
BigBench-Lite (General Reasoning)

Metrics:

Validation Perplexity
Convergence Speed (tokens to reach baseline perplexity)
Pass@k / Accuracy
Statistical methodology: Reported results across multiple random seeds

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
NCA pre-pre-training consistently improves perplexity compared to training from scratch across different model scales on OpenWebText.
OpenWebText (Perplexity)	Perplexity Improvement	0.0	8.6	8.6
OpenWebText (Perplexity)	Perplexity Improvement	0.0	5.7	5.7
NCA pre-pre-training is more data-efficient than natural language (C4) pre-pre-training.
OpenWebText	Perplexity Improvement	0.0	5.0	5.0
NCA pre-pre-training accelerates convergence across multiple domains.
Various (Web, Math, Code)	Speedup Factor	1.0	1.6	0.6

Experiment Figures

Training curves (Validation Perplexity vs. Tokens) for OpenWebText, OpenWebMath, and CodeParrot.

Comparison of NCA pre-pre-training vs. C4 pre-pre-training on OpenWebText perplexity.

Main Takeaways

NCA pre-pre-training improves downstream performance and convergence speed across web text, math, and code domains.
Synthetic NCA data is significantly more data-efficient than natural language (C4) for pre-pre-training, outperforming it with 10x less data.
Optimal NCA complexity (compression ratio) varies by domain: code benefits from simpler dynamics, while math and web text prefer more complex/chaotic rules.
Attention layers capture the majority of transferable primitives (long-range dependencies), while MLPs are more sensitive to domain alignment.

📚 Prerequisite Knowledge

Prerequisites

Neural Cellular Automata (NCA) dynamics
Transformer architecture and autoregressive training
Kolmogorov complexity / compression ratios
Pre-training vs. Fine-tuning paradigms

Key Terms

NCA: Neural Cellular Automata—a system where a grid of cells updates states based on a neural network applied locally to neighbors

Pre-pre-training: An initial training phase on synthetic data before the standard pre-training on natural language corpora

Zipfian distribution: A statistical distribution where the frequency of any word is inversely proportional to its rank; common in natural language

gzip compression ratio: A metric used here as a proxy for structural complexity; higher compression implies simpler, more predictable patterns

Dyck-k: A formal language of balanced parentheses with k distinct bracket types, often used to test recursive reasoning

Perplexity: A measurement of how well a probability model predicts a sample; lower values indicate better prediction