Learning Facts at Scale with Active Reading

📝 Paper Summary

Knowledge internalization Synthetic data generation Expert domain adaptation

Active Reading improves factual recall in LLMs by having the model self-generate diverse study strategies (paraphrasing, quizzing, linking) for documents rather than just training on raw text.

Core Problem

LLMs struggle to reliably learn and recall facts from their training data, especially for long-tail information that appears sparsely, and standard fine-tuning often leads to overfitting or hallucinations.

Why it matters:

Rote memorization of raw text is inefficient for robust knowledge integration and generalization
Current methods like simple repetition or standard paraphrasing plateau in performance as data scale increases
Integrating new knowledge into models remains brittle, limiting the creation of expert models in domains like finance or medicine

Concrete Example: When finetuning on raw Wikipedia documents, a model might fail to answer 'Who received the IEEE Frank Rosenblatt Award in 2010?' because it only saw the fact once in a specific context. Standard augmentation (paraphrasing) provides limited diversity, leading to recall failures on such tail facts.

Key Novelty

Active Reading (Self-Generated Study Strategies)

Instead of using fixed templates (like just QA pairs), the model prompts itself to invent diverse 'study strategies' (e.g., create a timeline, explain via analogy, map concepts) for a given document
These self-generated strategies are then executed by the model to create diverse synthetic training data that forces it to process the information deeply from multiple angles

Architecture

Conceptual workflow of Active Reading. It contrasts standard training (reading raw text) with Active Reading (generating strategies -> synthesizing data -> learning).

Evaluation Highlights

+50 percentage points absolute improvement (16% -> 66%) on SimpleWikiQA compared to vanilla finetuning, outperforming standard paraphrasing and synthetic QA
Meta WikiExpert-8B (trained on 1T Active Reading tokens) outperforms the much larger Llama 3.1 405B on SimpleQA (23.5% vs 17.1%)
Superior scaling behavior: Unlike synthetic QA which plateaus, Active Reading performance continues to improve linearly as synthetic data volume scales up to 4B words

Breakthrough Assessment

8/10

Significant jump in factual recall for 8B models, outperforming 400B+ models on specific benchmarks. The method offers a scalable alternative to RAG for knowledge-intensive tasks.

⚙️ Technical Details

Problem Definition

Setting: Closed-book factual Question Answering (QA) after training on a specific corpus of knowledge

Inputs: A corpus of documents D (e.g., Wikipedia articles or financial reports)

Outputs: A model parameter set θ capable of answering factual questions about D without external retrieval

Pipeline Flow

Strategy Generation (Model proposes how to study a chunk)
Data Synthesis (Model executes strategy to create text)
Filter/Mix (Combine with pre-training data)
Training (Standard autoregressive training on synthetic data)

System Modules

Strategy Generator (Data Generation)

Propose diverse learning strategies (e.g., 'make a timeline', 'write a dialogue') for a specific document chunk

Model or implementation: Llama 3.1 70B Instruct (used for generation)

Content Synthesizer (Data Generation)

Generate the actual synthetic training text by applying the strategy to the document

Model or implementation: Llama 3.1 70B Instruct (used for generation)

Learner Model

Internalize the knowledge by training on the synthetic data

Model or implementation: Llama 3.1 8B Base

Modeling

Base Model: Llama 3.1 8B Base (and Llama 3.1 70B Base for scaling comparison)

Training Method: Supervised Fine-Tuning / Continued Pre-training

Objective Functions:

Purpose: Standard language modeling.

Formally: Autoregressive next-token prediction loss.

Training Data:

Source: Wikipedia (6 million articles) for WikiExpert
Source: SimpleWikiQA documents (subset) for controlled experiments
Augmentation: 1 Trillion tokens generated for WikiExpert using Active Reading
Mixing: 1:1 ratio of Augmented Wikipedia to Pre-training data (DCLM) for final run

Key Hyperparameters:

learning_rate: 3e-4 (for large scale / WikiExpert), 1e-5 (for small scale finetuning)
batch_size: 4,194,304 tokens (large scale), 128 (small scale)
sequence_length: 4096
+ 2 more
epochs: 4 (for WikiExpert)
total_steps: 20,000 (small scale), 200k (scaling laws experiments)

Compute: Generated 1T tokens of synthetic data. Training involved 8T tokens total (4 epochs over 1T AR data + 1T pretrain data). Specific GPU hours not reported.

Comparison to Prior Work

vs. Synthetic QA: Active Reading generates full-text diverse formats (timelines, dialogues) rather than just QA pairs, leading to better scaling and recall
vs. Paraphrasing: Active Reading creates semantically diverse study materials rather than just syntactic variations, delaying performance saturation
vs. EntiGraph: Active Reading encompasses concept mapping but adds broader strategies; authors note EntiGraph underperforms SynthQA in prior work
+ 1 more
vs. RAG (Gold Context): Active Reading is a parametric method (weights only) whereas RAG requires external retrieval at inference time

Limitations

Scaling to larger corpora (full Wikipedia) requires aggressive learning rates and mixing pre-training data to prevent catastrophic forgetting
In-context learning (RAG with gold context) still outperforms parametric memory on tasks requiring complex reasoning (FinanceBench)
Generating 1 trillion tokens of synthetic data is computationally expensive compared to standard training on raw tokens
Performance degradation observed when scaling the document set size without careful hyperparameter tuning (learning rate, data mixing)

Reproducibility

Code: https://huggingface.co/datasets/facebook/meta-active-reading

publicly available: Model weights (Meta WikiExpert-8B) and dataset (Meta Active Reading) are on HuggingFace. Code URL provided in abstract. Prompts for strategy generation are in Appendix 10.3.

📊 Experiments & Results

Evaluation Setup

Closed-book Question Answering on factual benchmarks

Benchmarks:

SimpleQA (Adversarial factual QA (tail facts))
SimpleWikiQA (Subset of SimpleQA grounded in Wikipedia documents) [New]
FinanceBench (Financial domain QA (Information Extraction subset))
NaturalQuestions (General factual QA)
TriviaQA (General factual QA)

Metrics:

Recall (Accuracy of correct fact retrieval)
GPT-4o Model Grader score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Expert Domain Adaptation: Comparing Active Reading against baselines on specific document sets (SimpleWikiQA and FinanceBench).
SimpleWikiQA	Recall/Accuracy	15.92	66.25	+50.33
FinanceBench	Recall/Accuracy	18.43	66.18	+47.75
Pre-training Scale: Evaluating the WikiExpert model (trained on 1T synthetic tokens) against larger base models.
SimpleQA	Accuracy	7.3	23.5	+16.2
NaturalQuestions	Accuracy	29.0	31.2	+2.2
TriviaQA	Accuracy	64.3	68.5	+4.2

Experiment Figures

Scaling trends of factual recall (SimpleWikiQA and FinanceBench) vs. amount of synthetic data generated (up to 4B words).

Diversity analysis (Self-BLEU) of generated data across different methods.

Main Takeaways

Active Reading enables 8B models to match or exceed the factual recall of 400B+ models on tail facts
Data diversity (measured by Self-BLEU) correlates with better scaling; Active Reading generates more diverse data than paraphrasing or QA, preventing performance saturation
Mixing in original pre-training data is critical when scaling up knowledge injection to prevent catastrophic forgetting and maintain guardrail performance
Simply using a larger model (70B) to generate training data for a smaller model (8B) does not automatically yield better results than the smaller model generating its own study materials

📚 Prerequisite Knowledge

Prerequisites

Language model pre-training and fine-tuning
Synthetic data generation techniques
Knowledge distillation/augmentation

Key Terms

SimpleQA: A benchmark for evaluating the factual consistency of language models, focusing on short, fact-seeking questions where the answer is a specific entity or date

SimpleWikiQA: A subset of SimpleQA created by the authors where questions are grounded in specific Wikipedia documents to test expert domain adaptation

FinanceBench: A question answering benchmark grounded on financial disclosure documents, used to test expert domain knowledge

Self-BLEU: A metric measuring diversity in generated text by calculating the BLEU score of a generated sentence against other generated sentences from the same source; lower scores indicate higher diversity

Guardrail metrics: Standard benchmarks (like NaturalQuestions or TriviaQA) used to ensure a model hasn't lost general capabilities or previous knowledge while learning new specific information

Catastrophic forgetting: The tendency of neural networks to lose previously learned information upon learning new information

Active Reading: The proposed framework where an LLM generates its own strategies (e.g., timelines, analogies) to process a document and create synthetic training data

Mid-training: A training phase between pre-training and fine-tuning, often used to inject domain-specific knowledge or align the model before the final task adaptation

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents