Do Larger Language Models Imply Better Reasoning? A Pretraining Scaling Law for Reasoning

📝 Paper Summary

Language Model Pretraining Scaling Laws

Unlike standard scaling laws, scaling model size for implicit reasoning follows a U-shaped curve where overparameterization hurts performance, governed by the information-theoretic complexity of the training data.

Core Problem

Standard scaling laws suggest larger models always improve performance, but the effect of scaling on 'implicit reasoning' (deriving new conclusions from pretraining data without explicit CoT) is poorly understood.

Why it matters:

Current scaling laws primarily focus on perplexity or memorization, not the ability to reason over world knowledge acquired during pretraining
Over-investing in model size might be detrimental for specific reasoning tasks if the model capacity far exceeds the complexity of the underlying knowledge structure
Understanding reasoning capacity per parameter is crucial for efficient pretraining resource allocation

Concrete Example: If a model learns 'A is father of B' and 'B is father of C', it should implicitly deduce 'A is grandfather of C'. The paper shows that simply increasing model size can cause the model to memorize the edges rather than learn the transitive rule, leading to worse performance on unseen valid deductions.

Key Novelty

U-shaped Reasoning Scaling Law & Graph Search Entropy

Discovers a U-shaped loss curve for reasoning tasks: performance improves with model size up to an optimal point, then degrades due to overfitting/memorization
Proposes 'Graph Search Entropy' to quantify the reasoning complexity of a knowledge graph, defined by the entropy rate of random walks over the graph
Establishes a linear relationship between optimal model size and graph search entropy, finding LMs can reason over ~0.008 bits of information per parameter

Architecture

Illustration of the synthetic knowledge graph generation process, showing how node types and preferential attachment determine connectivity based on predefined logic rules.

Evaluation Highlights

Identifies a U-shaped scaling curve where reasoning loss initially decreases but then increases as models grow beyond an optimal size (e.g., optimal size is ~1M parameters for small synthetic graphs)
Demonstrates a strong linear correlation (R²=0.85) between the optimal model size and the proposed Graph Search Entropy metric across diverse synthetic graph configurations
Predicts the optimal model size for the real-world FB15K-237 dataset accurately using the proposed scaling law derived from synthetic data

Breakthrough Assessment

7/10

Challenges the 'bigger is better' dogma for reasoning capabilities during pretraining and offers a novel information-theoretic metric to predict optimal size. However, findings are primarily based on synthetic/simplified environments.

⚙️ Technical Details

Problem Definition

Setting: Pretraining a language model on a corpus derived from a Knowledge Graph G to minimize next-token prediction loss, then evaluating on link prediction for edges deducible via logic rules.

Inputs: Triples (h, r, t) from a knowledge graph, serialized as sequences of random IDs

Outputs: Probability distribution over possible tail entities t given (h, r)

Pipeline Flow

Graph Generation (Synthetic or Real)
Corpus Construction (Linearization of triples)
Language Model Pretraining (Next-token prediction)
Evaluation (Link prediction on held-out deducible triples)

System Modules

Graph Generator

Generates synthetic knowledge graphs with controlled properties (rules, density, size) using preferential attachment

Model or implementation: Custom Python Algorithm

Tokenizer/Serializer

Converts graph triples into text sequences using random IDs to remove lexical cues

Model or implementation: Character-level tokenizer

Language Model

Learns to predict tail entities given head and relation

Model or implementation: Llama architecture (various sizes)

Novel Architectural Elements

Controlled synthetic environment for isolating reasoning scaling: pure structural reasoning without lexical cues (random IDs)
Graph Search Entropy metric integration: relating model capacity directly to the information-theoretic properties of the graph structure rather than just token count

Modeling

Base Model: Llama architecture (custom sizes ranging from tiny to small)

Training Method: Pretraining from scratch via Next-Token Prediction

Objective Functions:

Purpose: Minimize negative log-likelihood of the tail entity given head and relation.

Formally: L(θ) = -Σ log P(e_t | e_h, r; θ)

Training Data:

Synthetic graphs: generated via preferential attachment and logic rules
Real graph: FB15K-237 processed into triples
Data format: (head_id, relation_id, tail_id) sequences

Key Hyperparameters:

training_steps: Up to 10,000 steps
batch_size: Not reported in the paper
learning_rate: Not reported in the paper
+ 1 more
graph_epochs: 30 (for FB15K-237)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Kaplan/Chinchilla: Finds U-shaped scaling (overfitting at large scale) for specific reasoning tasks, contradicting monotonic power laws
vs. Allen-Zhu & Li: Finds significantly lower capacity per parameter for reasoning (0.008 bits) compared to memorization (2 bits), highlighting the difficulty gap
vs. Inverse Scaling [cited in paper]: Confirms performance degradation with scale but identifies a specific U-shape and optimal point, rather than monotonic decrease

Limitations

Experiments primarily use synthetic data and small-scale models, not full-scale LLM pretraining
Focuses on implicit reasoning (graph completion) rather than general linguistic reasoning or CoT
The 'U-shape' suggests overfitting to specific graph structures, which might differ from natural language redundancy
Graph Search Entropy computation relies on simplifying assumptions about random walks

Reproducibility

Code: https://github.com/WANGXinyiLinda/reasoning-scaling-law

Code is publicly available at https://github.com/WANGXinyiLinda/reasoning-scaling-law. Detailed graph generation algorithm is in Appendix. Exact training hyperparameters (LR, batch size) are missing from the main text.

📊 Experiments & Results

Evaluation Setup

Link prediction on held-out knowledge graph triples

Benchmarks:

Synthetic Knowledge Graphs (Implicit Multi-hop Reasoning (Link Completion)) [New]
FB15K-237 (Knowledge Graph Completion)

Metrics:

Testing Loss (Reasoning Loss)
Accuracy (10-option multiple choice)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Ablation studies on synthetic graphs show how data properties shift the optimal model size.
Synthetic Graphs	Reasoning Capacity per Parameter	2.0	0.008	-1.992
Synthetic Graphs	R² (Optimal Model Size vs Graph Entropy)	0.0	0.85	+0.85
FB15K-237	Optimal Model Size Prediction	Not reported in the paper	Not reported in the paper	Not reported in the paper

Experiment Figures

Plots of Training Loss and Testing Loss vs Model Size for FB15K-237 across three data formats (Natural Language, Template, Random IDs).

Six subplots (a-f) showing Testing Loss vs Model Size while varying graph hyperparameters (Training Steps, # Triples, # Rules, # Relations, Deducible Ratio, # Entities).

Scatter plot of Optimal Model Size vs Graph Search Entropy with a linear regression line.

Main Takeaways

Reasoning performance follows a U-shaped curve with model size: initially improving, then degrading due to overfitting/memorization.
Optimal model size is stable with respect to training steps (after sufficient training) but scales linearly with Graph Search Entropy.
More training triples, more relations, and higher graph connectivity increase the optimal model size.
The number of logic rules affects performance but does NOT significantly impact optimal model size.
Reasoning is 'expensive': LMs can only reason over ~0.008 bits of graph information per parameter, vs 2 bits for memorization.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Knowledge Graphs (entities, relations, triples)
Language Model Pretraining (next-token prediction)
Scaling Laws (Kaplan et al., Chinchilla)
Information Theory (Entropy rate, random walks)

Key Terms

implicit reasoning: The ability of a language model to draw new conclusions (deduce missing edges) from existing knowledge seen during pretraining without explicit chain-of-thought training

graph search entropy: A metric quantifying the complexity of a knowledge graph, calculated as the entropy rate of a maximal entropy random walk over the graph

deducible triples: Triples in a knowledge graph that can be inferred from other triples using a set of logical rules (e.g., transitivity)

atomic triples: Triples in a knowledge graph that cannot be inferred from other triples and must be memorized

optimal model size: The specific model parameter count that achieves the minimum testing loss for a given dataset, derived from the bottom of the U-shaped loss curve

preferential attachment: A graph generation process where new nodes prefer to attach to existing nodes with high degrees, creating scale-free networks

FB15K-237: A standard benchmark dataset for knowledge graph completion, derived from Freebase

broken neural scaling law: A deviation from power-law scaling where performance non-monotonically changes with scale (often double-descent or U-shaped)

inverse scaling: A phenomenon where larger models perform worse than smaller models on specific tasks

transitive rule: A logic rule where A->B and B->C implies A->C (e.g., ancestor relationships)