General Intelligence Requires Reward-based Pretraining

📝 Paper Summary

Reasoning in Large Language Models Pretraining Paradigms (RL vs. Supervised) Generalization and Transfer Learning

True general intelligence requires replacing supervised next-token pretraining with reward-based pretraining from scratch and architecturally decoupling reasoning from knowledge to avoid overfitting to spurious correlations.

Core Problem

Supervised pretraining on passive data causes LLMs to rely on spurious correlations (memorized patterns) rather than underlying reasoning algorithms, creating a 'local minimum' that post-training RL cannot escape.

Why it matters:

Current LLMs (AUI) fail to generalize algorithmic understanding to novel contexts, limiting their reliability in real-world adaptability
The dominant 'AlphaGo-style' paradigm (Supervised Pretraining + RL Finetuning) biases exploration, preventing models from discovering generalizable strategies
Reliance on massive context windows encourages models to cheat by looking for pattern matches rather than computing solutions

Concrete Example: When prompted to write Python code using 1-based indexing (instead of the standard 0-based), models fail to override their memorized patterns and revert to 0-based indexing. Similarly, models proficient in Python fail to solve simple sorting tasks when presented in the esoteric language Brainf**k.

Key Novelty

Shift from AlphaGo (SPT+RFT) to AlphaZero (RPT) Paradigm for LLMs

Proposes Reward-based Pretraining (RPT) from scratch as superior to Supervised Pretraining (SPT), arguing that SPT biases models toward memorization
Introduces an evaluation benchmark using esoteric programming languages (Brainf**k, Befunge) to strictly isolate reasoning capabilities from memorized syntax
Suggests architectural disentanglement where a 'Reasoning Unit' with a small context window interacts with an 'External Memory' to prevent reliance on surface-level token correlations

Evaluation Highlights

Current SOTA LLMs average only ~12% accuracy on Brainf**k tasks and ~29% on Befunge tasks, failing to transfer simple algorithmic logic
In controlled Go 9x9 experiments, Reward-based Pretraining (RPT) achieves a 100% win rate against Supervised Pretraining (SPT)
RPT outperforms SPT followed by RL Finetuning (SPT+RFT) with a 92% win rate when the latter is constrained by a strict KL penalty (0.5), showing that supervised priors hinder exploration

Breakthrough Assessment

8/10

Strong position paper challenging the dominant scaling/pretraining paradigm. Provides compelling evidence via the 'AlphaZero vs AlphaGo' analogy and a clever esoteric language benchmark, though the proposed architecture is theoretical.

⚙️ Technical Details

Problem Definition

Setting: Learning robust reasoning policies that generalize to out-of-distribution (OOD) contexts

Inputs: Algorithmic tasks in novel syntaxes (e.g., esoteric languages) or game states (Go)

Outputs: Correct program outputs or winning game moves

Pipeline Flow

Input Task
Reasoning Network (Small Context)
External Memory Bank / Retrieval System
Output Solution

System Modules

Reasoning Network

Execute logical operations and reasoning steps over a limited context window to force algorithmic learning over pattern matching

Model or implementation: Proposed Architecture (details not specified)

External Memory Bank

Store factual knowledge and data separate from the reasoning weights

Model or implementation: Retrieval System

Novel Architectural Elements

Separation of 'Reasoning' (network weights) and 'Knowledge' (external memory) to enable transferability
Imposition of small context windows on the reasoning module as an inductive bias against spurious correlations

Modeling

Base Model: Llama 3.1 (8B/70B), Qwen2.5 Coder (7B/32B), GPT-4o, o1 (for evaluation)

Training Method: Reinforcement Learning (for the Go experiments described in Section 4)

Objective Functions:

Purpose: Maximize expected cumulative reward from interactions.

Formally: max_theta E[sum(r_t(x))]
Purpose: Regularize RL finetuning to stay close to pretrained model (SPT+RFT baseline).

Formally: Reward includes KL penalty term

Training Data:

Go 9x9: Self-play data (RPT) vs. Expert demonstrations from top 100 players (SPT)
Esoteric Benchmark: Synthetic tasks (printing, sorting, copying, factorial, Fibonacci) translated into Brainf**k and Befunge

Key Hyperparameters:

KL_coefficient_loose: 0.1
KL_coefficient_strict: 0.5

Compute: Not reported in the paper

Comparison to Prior Work

vs. AlphaGo/Standard LLMs: Proposes removing the supervised pretraining phase entirely for reasoning tasks to avoid local minima
vs. o1: Acknowledges o1 uses RL post-training but argues it is still limited by its underlying SPT foundation (evidenced by imperfect esoteric performance)

Limitations

RL from scratch on full natural language (40K+ tokens) is currently computationally infeasible due to exploration space
Evaluation of the proposed architecture is limited to small-scale proxies (Go 9x9) and benchmark analysis of existing models
Requires developing a curriculum of synthetic tasks to bridge the gap between random initialization and useful language reasoning
Paper cuts off before detailing results for the synthetic vector reasoning task

Reproducibility

The paper defines the logic for the Esoteric Benchmark (tasks and languages) and provides prompt templates in Appendices. The Go environment is a standard 9x9 board. Code URL is not explicitly provided in the text.

📊 Experiments & Results

Evaluation Setup

Evaluation of reasoning transfer on OOD algorithmic tasks and controlled RL training experiments

Benchmarks:

Esoteric Language Benchmark (Algorithmic reasoning (sorting, printing, math) in Brainf**k and Befunge) [New]
Go 9x9 (Board game strategy)

Metrics:

Accuracy (percentage of correct outputs)
Win Rate (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance of state-of-the-art LLMs on the Esoteric Language Benchmark, showing poor generalization to novel syntaxes despite the tasks being algorithmically simple.
Esoteric Language Benchmark (Brainf**k)	Average Accuracy	0	12	+12
Esoteric Language Benchmark (Befunge)	Average Accuracy	0	29	+29
Esoteric Language Benchmark (Brainf**k - Sorting)	Accuracy	0	1.0	+1.0
Esoteric Language Benchmark (Befunge - Printing)	Accuracy	Not reported in the paper	65.5	Not reported in the paper
Controlled experiments on Go 9x9 comparing training paradigms: Supervised Pretraining (SPT) vs. Reward-based Pretraining (RPT) from scratch.
Go 9x9	Win Rate	0	100	+100
Go 9x9	Win Rate	8	92	+84
Go 9x9	Win Rate	34	66	+32

Experiment Figures

Win rates of Reward-based Pretraining (RPT) agents against various Supervised Pretraining (SPT) baselines in Go 9x9

Main Takeaways

Current LLMs perform poorly on algorithmic tasks in esoteric languages (Brainf**k, Befunge) despite succeeding in Python, indicating a lack of true generalizable reasoning.
In-context learning (10 examples) provides only marginal gains in Brainf**k (~4%), suggesting models cannot infer the underlying reasoning engine from few shots.
Controlled Go experiments confirm that supervised pretraining on passive data (demonstrations) biases subsequent RL, preventing it from reaching the optimal performance achieved by RL from scratch.
The 'o1' model, which uses RL post-training, performs best on the esoteric benchmark but still fails tasks like sorting, supporting the hypothesis that SPT limits the ceiling of reasoning capabilities.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals
Language Model Pretraining (Next-token prediction)
AlphaGo vs. AlphaZero training methodologies
KL Divergence (as a regularization term)

Key Terms

AUI: Artificial Useful Intelligence—systems effective at specific real-world tasks but lacking adaptive reasoning

AGI: Artificial General Intelligence—systems capable of robust, adaptive reasoning across diverse domains

SPT: Supervised Pretraining—task-agnostic pretraining using next-token prediction loss on passive data

RPT: Reward-based Pretraining—training from scratch using reinforcement learning to maximize a reward signal

RFT: Reward-based Finetuning—applying RL after supervised pretraining

Passive Data: Data resulting from human reasoning (e.g., Internet text) that contains the 'what' but not the 'why' (reasoning traces)

Spurious Correlations: Superficial statistical patterns between tokens that models exploit to predict answers without understanding the underlying logic

Brainf**k: An esoteric programming language with a minimal command set (8 commands) and a tape-based memory model

Befunge: A two-dimensional stack-based esoteric programming language where code execution follows paths on a grid

KL penalty: A regularization term used in RL to keep the model's policy close to a reference policy (usually the pretrained model)