Synthetic continued pretraining

📝 Paper Summary

Continued Pretraining Synthetic Data Generation Knowledge Injection

To teach language models knowledge from small corpora, this paper proposes generating a massive synthetic dataset by prompting a stronger model to describe relations between entities extracted from the source documents.

Core Problem

Pretrained models struggle to learn facts from small, domain-specific corpora (like a single textbook) because effective knowledge acquisition requires training on hundreds of diverse representations of the same fact, which small static datasets lack.

Why it matters:

Current methods require massive internet-scale data for knowledge acquisition, making it difficult to adapt models to private or niche domains with limited text
Simple paraphrasing of small datasets fails to provide the necessary diversity for the model to generalize and retain the information
As public data is exhausted, future improvements will rely on learning from the 'tails' of the data distribution (rare/niche documents)

Concrete Example: Directly training a model on a linear algebra textbook fails because the concepts appear too few times. EntiGraph solves this by extracting entities like 'Vector' and 'Linear space' and generating diverse synthetic 'notes' about their relationships, simulating the variety found in online discussions.

Key Novelty

EntiGraph (Entity-centric Synthetic Augmentation)

Instead of simply paraphrasing sentences, the method constructs a 'knowledge graph' by extracting entities from the source
It then prompts an LLM to generate diverse text descriptions for specific pairs or triplets of entities, explicitly 'filling in' the edges of the graph to create new, diverse contexts for the same facts

Architecture

The EntiGraph data augmentation pipeline process.

Evaluation Highlights

Synthetic continued pretraining with 455M EntiGraph tokens recovers 80% of the accuracy gain achievable by providing the source documents at inference time (RAG)
Achieves a log-linear scaling trend in Question Answering accuracy as the number of synthetic tokens increases up to 455M
Outperforms standard continued pretraining on original documents and simple paraphrasing baselines

Breakthrough Assessment

8/10

Offers a practical, scalable solution to the 'small corpus' learning problem. The log-linear scaling finding and the approach of externalizing diversity via graph traversal are significant methodogical contributions.

⚙️ Technical Details

Problem Definition

Setting: Parametric knowledge acquisition from a small source corpus via continued pretraining

Inputs: Small domain-specific source corpus D_source

Outputs: Language model with knowledge of D_source internalized in its weights

Pipeline Flow

Data Gen Group: Entity Extractor -> Relation Analyzer -> Synthetic Corpus Construction
Training Group: Synthetic Corpus -> CPT Learner

System Modules

Entity Extractor (Data Gen Group)

Extract salient entities from source documents

Model or implementation: gpt-4-turbo

Relation Analyzer (Data Gen Group)

Generate text descriptions of relations among subsets (pairs/triplets) of extracted entities

Model or implementation: gpt-4-turbo

CPT Learner

Acquire knowledge via next-token prediction on the synthetic corpus

Model or implementation: Llama 3 8B

Novel Architectural Elements

Hierarchical prompting strategy that externalizes diversity generation to combinatorial graph structures (entity pairs/triplets) rather than relying on generic rephrasing prompts

Modeling

Base Model: Llama 3 8B

Training Method: Continued Pretraining (CPT)

Objective Functions:

Purpose: Learn probability distribution of synthetic text.

Formally: Next-token prediction (standard Causal Language Modeling loss)

Adaptation: Full model update (Continued Pretraining)

Trainable Parameters: All parameters of Llama 3 8B

Training Data:

Source: 265 books from QuALITY dataset (1.3M tokens)
Synthetic Output: 455M tokens generated by EntiGraph using gpt-4-turbo

Compute: Not reported in the paper

Comparison to Prior Work

vs. Rephrase baseline: EntiGraph explicitly models entity relations to enforce diversity, whereas rephrasing yields diminishing returns
vs. RAG: EntiGraph internalizes knowledge into parameters (parametric) rather than requiring retrieval at inference time (non-parametric), though they are complementary
vs. Knowledge Editing: EntiGraph targets learning entire corpora/domains rather than atomic (subject, relation, object) triples

Limitations

Relies on a powerful proprietary model (gpt-4-turbo) for data synthesis, which may be costly
Synthetic data generation can potentially hallucinate relations not present in the source text (though conditioned on source)
Evaluation is primarily on one reading comprehension dataset (QuALITY)

Reproducibility

Prompt templates for 'entity_extraction' and 'relation_analysis' are provided in the paper and Appendix. Code URL is not explicitly provided in the text. The specific subset of QuALITY books used is described.

📊 Experiments & Results

Evaluation Setup

Question Answering on the QuALITY dataset (Reading Comprehension) without access to source documents at test time (Closed-book)

Benchmarks:

QuALITY (Multiple-choice Question Answering)

Metrics:

QA Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
QuALITY processing	Token Count	1.3M	455M	+453.7M
QuALITY	Relative Recovery of RAG Performance	100	80	-20

Main Takeaways

Simple paraphrasing saturates quickly; adding more paraphrased tokens yields diminishing returns compared to the log-linear scaling of EntiGraph.
The knowledge acquired via Synthetic CPT is complementary to RAG; combining EntiGraph-trained models with RAG yields better performance than RAG with a base model.
The method effectively converts compute (generation of synthetic data) into data efficiency (learning from small corpora).

📚 Prerequisite Knowledge

Prerequisites

Understanding of Causal Language Modeling (Next-token prediction)
Familiarity with RAG (Retrieval-Augmented Generation)
Concept of Knowledge Graphs (Entities and Relations)

Key Terms

EntiGraph: The proposed data augmentation algorithm that extracts entities and generates text describing their relations to create a synthetic pretraining corpus

Synthetic CPT: Synthetic Continued Pretraining—generating a large synthetic corpus from a small source and then pretraining a model on it

RAG: Retrieval-Augmented Generation—providing the model with relevant source documents as context during inference

Parametric knowledge: Knowledge stored directly in the model's neural network weights, as opposed to knowledge accessed via external retrieval

QuALITY: A reading comprehension dataset used here as the source corpus for evaluation

SVD: Singular Value Decomposition—a mathematical concept used as an example entity in the paper

Log-linear scaling: A relationship where the performance metric improves linearly as the logarithm of the dataset size increases