Evaluation Setup
Pre-pre-train on NCA, then Pre-train on domain corpora, then evaluate perplexity or fine-tune for reasoning tasks.
Benchmarks:
- OpenWebText (Language Modeling)
- OpenWebMath (Math Language Modeling)
- CodeParrot (Code Language Modeling)
- GSM8K (Math Reasoning)
- HumanEval (Code Generation)
- BigBench-Lite (General Reasoning)
Metrics:
- Validation Perplexity
- Convergence Speed (tokens to reach baseline perplexity)
- Pass@k / Accuracy
- Statistical methodology: Reported results across multiple random seeds
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| NCA pre-pre-training consistently improves perplexity compared to training from scratch across different model scales on OpenWebText. |
| OpenWebText (Perplexity) |
Perplexity Improvement |
0.0 |
8.6 |
8.6
|
| OpenWebText (Perplexity) |
Perplexity Improvement |
0.0 |
5.7 |
5.7
|
| NCA pre-pre-training is more data-efficient than natural language (C4) pre-pre-training. |
| OpenWebText |
Perplexity Improvement |
0.0 |
5.0 |
5.0
|
| NCA pre-pre-training accelerates convergence across multiple domains. |
| Various (Web, Math, Code) |
Speedup Factor |
1.0 |
1.6 |
0.6
|
Main Takeaways
- NCA pre-pre-training improves downstream performance and convergence speed across web text, math, and code domains.
- Synthetic NCA data is significantly more data-efficient than natural language (C4) for pre-pre-training, outperforming it with 10x less data.
- Optimal NCA complexity (compression ratio) varies by domain: code benefits from simpler dynamics, while math and web text prefer more complex/chaotic rules.
- Attention layers capture the majority of transferable primitives (long-range dependencies), while MLPs are more sensitive to domain alignment.