How Do Large Language Models Acquire Factual Knowledge During Pretraining?

📝 Paper Summary

LLM Pretraining Dynamics Knowledge Memorization and Generalization

Factual knowledge acquisition in LLMs occurs via small probability increases that accumulate but are diluted by power-law forgetting, which can be mitigated by larger batch sizes and data deduplication.

Core Problem

While LLMs are known to store factual knowledge, the specific mechanisms of how they acquire, maintain, and forget this knowledge during the pretraining process are poorly understood.

Why it matters:

Understanding acquisition dynamics is crucial for explaining phenomena like the failure to learn long-tail knowledge
Clarifies the benefits of dataset deduplication and scaling laws beyond just loss metrics
Provides insights into how training conditions (batch size, model size) affect knowledge retention

Concrete Example: When a model encounters a fact like 'X is the CEO of Y' during pretraining, does it learn it immediately? If it sees it again later, does it remember? Current understanding doesn't explain why models fail to learn rare facts despite seeing them, or why deduplication helps.

Key Novelty

Step-wise Analysis of Factual Knowledge Injection

Injects unseen 'Fictional Knowledge' into intermediate pretraining checkpoints and tracks the exact log-probability trajectory of specific facts step-by-step
Defines 'Local Acquisition Maxima' to measure immediate learning and 'Retainability' to measure long-term forgetting
Identifies a power-law relationship between training steps and the forgetting of acquired knowledge

Architecture

Conceptual diagram of the metric definitions (Local Acquisition Maxima, Effectivity, Retainability) based on log probability trajectories.

Evaluation Highlights

Forgetting follows a power-law curve: LLMs lose acquired knowledge probability improvements at a predictable rate relative to training steps
Larger batch sizes (2048 vs 128) significantly enhance robustness to forgetting, allowing knowledge to be retained longer
Deduplicated training data leads to slower forgetting compared to duplicated data, explaining the empirical benefit of deduplication

Breakthrough Assessment

7/10

Provides a strong empirical framework and theoretical intuition (power-law forgetting) for pretraining dynamics, offering plausible explanations for known phenomena like long-tail forgetting and deduplication benefits.

⚙️ Technical Details

Problem Definition

Setting: LLM Pretraining (Continual Learning setting analysis)

Inputs: Pretraining corpus sequences with injected specific 'Fictional Knowledge' passages

Outputs: Log probabilities of target spans in cloze-style probes (Memorization, Semantic Generalization, Compositional Generalization)

Pipeline Flow

Load Intermediate Checkpoint
Inject Fictional Knowledge into Batch
Resume Pretraining
Monitor Probes

System Modules

Intermediate Checkpoint Loader

Loads OLMo model weights and optimizer states at specific pretraining steps (early, mid, late)

Model or implementation: OLMo-1B and OLMo-7B

Knowledge Injector

Replaces part of the original pretraining batch with 'Fictional Knowledge' sequences

Model or implementation: N/A (Data Processing)

Probe Evaluator

Calculates log probabilities for target spans across three depth levels

Model or implementation: OLMo (Inference mode)

Novel Architectural Elements

Evaluation pipeline integrated directly into the pretraining loop to measure step-wise log-probability evolution of specific injected facts

Modeling

Base Model: OLMo-1B and OLMo-7B

Training Method: Continued Pretraining with Knowledge Injection

Training Data:

Base: Dolma v1.5
Injected: Fictional Knowledge (generated via GPT-4 based on ECBD templates)

Key Hyperparameters:

learning_rate: Not explicitly reported in the paper (implies continuation of OLMo schedule)
batch_size: Varied (128 and 2048)
optimizer: AdamW (beta1=0.9)
+ 2 more
injection_interval: 100 steps
window_size_tw: 50 (for LAM calculation)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Biderman et al.: Focuses specifically on *factual* knowledge acquisition trajectories and fine-grained step-wise log-probability changes rather than general memorization metrics
vs. Grokking studies: Analyzes standard pretraining dynamics rather than overtraining on small datasets

Limitations

Study limited to OLMo architecture (1B and 7B models); may not generalize to Mixture-of-Experts or other architectures
Fictional knowledge serves as a proxy for real knowledge; dynamics might differ slightly for domain-specific real data
Analysis focuses on log-probability, which correlates with but is not identical to generation accuracy

Reproducibility

Code: https://github.com/kaistAI/factual-knowledge-acquisition

Code and data available at https://github.com/kaistAI/factual-knowledge-acquisition. Uses public OLMo checkpoints and Dolma data. Fictional Knowledge dataset construction details provided.

📊 Experiments & Results

Evaluation Setup

Inject synthetic facts into pretraining stream and monitor log-probability evolution

Benchmarks:

Fictional Knowledge Probes (Cloze-style completion) [New]

Metrics:

Log Probability (of target span)
Effectivity (immediate learning magnitude)
Retainability (fraction of learning retained over time)
Statistical methodology: IQR outlier detection (factor 1.5) applied to metrics

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The paper finds that pretraining stage does not significantly impact the immediate ability to acquire knowledge (Effectivity), but model size does.
Fictional Knowledge	Effectivity	Qualitatively similar to Late Stage	Qualitatively similar to Late Stage	Insignificant difference
Fictional Knowledge	Effectivity	Lower magnitude	Higher magnitude (OLMo-7B)	Positive
Forgetting follows a power-law relationship, and larger batch sizes reduce the rate of forgetting.
Fictional Knowledge	Retainability Trend	N/A	Power-law fit	N/A
Fictional Knowledge	Retainability	Faster forgetting rate	Slower forgetting rate (Batch Size 2048)	Positive retention

Experiment Figures

Progress of factual knowledge acquisition (log probability) for OLMo-7B across three depths: Memorization, Semantic Generalization, Compositional Generalization.

Main Takeaways

Acquisition happens via small probability bumps that are diluted by subsequent updates; knowledge is only 'learned' if the accumulation outpaces the power-law forgetting.
Data duplication accelerates forgetting of specific instances compared to deduplicated data streams.
Larger batch sizes improve knowledge retention, suggesting a trade-off between compute efficiency and knowledge stability.
The 'Long-tail' problem is explained by the fact that rare concepts appear too infrequently to overcome the power-law forgetting dilution.

📚 Prerequisite Knowledge

Prerequisites

LLM Pretraining basics (minibatch SGD, momentum)
Language Modeling objective (next token prediction)
Scaling laws

Key Terms

Fictional Knowledge dataset: A synthetic dataset of realistic but fake entities used to test knowledge acquisition without prior exposure bias

Local Acquisition Maxima (LAM): The timestep where the model's log probability for a fact reaches a local peak shortly after being trained on it, accounting for optimizer momentum

Effectivity: The absolute increase in log probability for a fact immediately after training (from pre-injection to LAM)

Retainability: The fraction of the initial log probability improvement that remains after t subsequent training steps

Power-law forgetting: The observation that the retention of learned knowledge decays according to a power function of the number of training steps

Knowledge Injection: The process of inserting specific target sequences into the pretraining data stream to monitor their learning dynamics