Procedural Pretraining: Initial training phase using data generated by explicit algorithms (e.g., sorting, formal languages) before standard training.
Dyck sequences: Strings of balanced parentheses (e.g., '(()())'), used to teach models nested structure and memory.
Needle-in-a-haystack: A task testing a model's ability to retrieve a specific piece of information ('needle') buried in a long context ('haystack').
Cellular Automata: Discrete computational systems (like Rule 110) where cells evolve based on local rules, used here to generate complex logical patterns.
Additive setting: Experiments where procedural data is added to a fixed amount of semantic data to measure performance gains.
Substitutive setting: Experiments where procedural data replaces a portion of semantic data to measure data efficiency.
MLP-only transfer: Initializing only the Multilayer Perceptron weights from the procedural model while randomizing attention weights.
Attention-only transfer: Initializing only the Attention weights from the procedural model while randomizing MLP weights.