← Back to Paper List

General Intelligence Requires Reward-based Pretraining

Seungwook Han, Jyothish Pari, Samuel J. Gershman, Pulkit Agrawal
Affiliations not listed in provided text
arXiv (2025)
Pretraining RL Reasoning Benchmark Memory

📝 Paper Summary

Reasoning in Large Language Models Pretraining Paradigms (RL vs. Supervised) Generalization and Transfer Learning
True general intelligence requires replacing supervised next-token pretraining with reward-based pretraining from scratch and architecturally decoupling reasoning from knowledge to avoid overfitting to spurious correlations.
Core Problem
Supervised pretraining on passive data causes LLMs to rely on spurious correlations (memorized patterns) rather than underlying reasoning algorithms, creating a 'local minimum' that post-training RL cannot escape.
Why it matters:
  • Current LLMs (AUI) fail to generalize algorithmic understanding to novel contexts, limiting their reliability in real-world adaptability
  • The dominant 'AlphaGo-style' paradigm (Supervised Pretraining + RL Finetuning) biases exploration, preventing models from discovering generalizable strategies
  • Reliance on massive context windows encourages models to cheat by looking for pattern matches rather than computing solutions
Concrete Example: When prompted to write Python code using 1-based indexing (instead of the standard 0-based), models fail to override their memorized patterns and revert to 0-based indexing. Similarly, models proficient in Python fail to solve simple sorting tasks when presented in the esoteric language Brainf**k.
Key Novelty
Shift from AlphaGo (SPT+RFT) to AlphaZero (RPT) Paradigm for LLMs
  • Proposes Reward-based Pretraining (RPT) from scratch as superior to Supervised Pretraining (SPT), arguing that SPT biases models toward memorization
  • Introduces an evaluation benchmark using esoteric programming languages (Brainf**k, Befunge) to strictly isolate reasoning capabilities from memorized syntax
  • Suggests architectural disentanglement where a 'Reasoning Unit' with a small context window interacts with an 'External Memory' to prevent reliance on surface-level token correlations
Evaluation Highlights
  • Current SOTA LLMs average only ~12% accuracy on Brainf**k tasks and ~29% on Befunge tasks, failing to transfer simple algorithmic logic
  • In controlled Go 9x9 experiments, Reward-based Pretraining (RPT) achieves a 100% win rate against Supervised Pretraining (SPT)
  • RPT outperforms SPT followed by RL Finetuning (SPT+RFT) with a 92% win rate when the latter is constrained by a strict KL penalty (0.5), showing that supervised priors hinder exploration
Breakthrough Assessment
8/10
Strong position paper challenging the dominant scaling/pretraining paradigm. Provides compelling evidence via the 'AlphaZero vs AlphaGo' analogy and a clever esoteric language benchmark, though the proposed architecture is theoretical.
×