Knowledge-Instruct: Effective Continual Pre-training from Limited Data using Instructions

📝 Paper Summary

Continual Pre-training (CPT) Knowledge Injection

Knowledge-Instruct transforms small raw text corpora into information-dense synthetic instruction-response pairs to inject new knowledge into LLMs without catastrophic forgetting or breaking chat templates.

Core Problem

Standard Continual Pre-training (CPT) on small corpora (~100K tokens) fails because LLMs require vast repetition to internalize facts, and unsupervised training on raw text degrades instruction-following capabilities.

Why it matters:

LLMs struggle with niche, domain-specific, or new information not present in their massive pre-training datasets.
Existing CPT methods work well for large datasets (billions of tokens) but suffer from catastrophic forgetting and poor learning efficiency in low-data regimes.
Standard unsupervised CPT breaks the chat template of instruction-tuned models, requiring an additional expensive fine-tuning phase to restore conversational ability.

Concrete Example: A manual or textbook might cover a topic in only ~100K tokens. When standard CPT is applied to this small amount of data, the model fails to learn the facts due to lack of repetition and variations. For instance, in the 'Companies' dataset of 23 fictional companies, standard CPT resulted in near-zero accuracy on factual questions about those companies.

Key Novelty

Knowledge-Instruct

Synthesizes a massive amount of diverse instruction-response pairs from a small document corpus, focusing on entities and facts rather than just raw text prediction.
Uses a multi-step pipeline (extract entities, extract facts, contextualize, deduplicate, paraphrase) to create high-quality synthetic training data that forces the model to learn facts through supervised fine-tuning (SFT).
Injects knowledge directly into instruction-tuned models, avoiding the need for a separate unsupervised pre-training stage that degrades chat capabilities.

Architecture

The six-step data generation pipeline for Knowledge-Instruct.

Evaluation Highlights

Achieves >80% accuracy on the Companies dataset (entirely new knowledge) with Llama-3.1-8B, while standard CPT and Rephrase CPT remain near 0%.
Outperforms standard CPT and Synthetic CPT on PopQA (long-tail knowledge), surpassing even GPT-4o on specific long-tail queries.
Improves multi-hop reasoning on MultiHop-RAG by +24.4 points (Acc) using Llama-3.1-8B compared to standard CPT in oracle settings.

Breakthrough Assessment

8/10

Strong practical contribution for domain adaptation in low-data regimes. Solves the 'chat template breakage' issue of CPT while significantly outperforming standard methods on memorization.

⚙️ Technical Details

Problem Definition

Setting: Continual Pre-training (CPT) on a small corpus D of documents to inject specific knowledge into a pre-trained instruction-tuned model.

Inputs: A small corpus of raw text documents (e.g., ~100K tokens).

Outputs: An instruction-tuned LLM capable of answering factual questions about the corpus D.

Pipeline Flow

Entity Extraction: Identify entities in documents
Factual Extraction: Extract facts associated with entities
Contextualization: Ensure facts are self-contained
Deduplication: Remove redundant facts
Paraphrasing: Generate k variations of each fact
Instruction Conversion: Convert paraphrases into Q&A pairs
Training: Fine-tune the target model on the generated dataset

System Modules

Extraction Model

Extract entities and facts, and generate paraphrases

Model or implementation: GPT-4o-mini

Target Model

Learn the new knowledge via SFT

Model or implementation: Llama-3.1-8B-Instruct or Phi-4-14B

Novel Architectural Elements

Data synthesis pipeline that systematically converts raw text into a 'knowledge graph'-like structure of instruction pairs (Entity -> Fact -> Contextualized Fact -> Paraphrases -> Instructions) to simulate high-repetition pre-training within an SFT framework.

Modeling

Base Model: Llama-3.1-8B-Instruct and Phi-4-14B

Training Method: Supervised Fine-Tuning (SFT) on synthetic data

Training Data:

Input: Raw corpus D
Process: Extract entities/facts, generate k paraphrases per fact
Output: Aggregated dataset D_train of instruction-response pairs

Compute: Not explicitly reported in the paper (implies standard SFT costs)

Comparison to Prior Work

vs. Standard CPT: Uses SFT instead of unsupervised learning; focuses on facts/entities rather than next-token prediction on raw text.
vs. Rephrase CPT: Structures data into Q&A pairs rather than just rewriting documents; targets specific facts.
vs. Synthetic CPT (Yang et al.): More efficient data generation (does not require massive token counts) and skips the intermediate CPT phase, training directly via SFT.
+ 1 more
vs. Q&A generation (general SFT) [not cited in paper]: Unlike generic Q&A generation which targets reasoning or style, Knowledge-Instruct explicitly targets exhaustive factual coverage and repetition for memorization.

Limitations

Relies on an external model (e.g., GPT-4o-mini) for data synthesis, which adds cost and dependency.
Performance on complex reasoning (MultiHop-RAG without oracle) is lower compared to memorization tasks, though still better than baselines.
Does not explicitly address how to handle conflicting knowledge or knowledge updates (updating existing facts vs. learning new ones).

Reproducibility

Code: https://github.com/meniData1/knowledge-instruct

Code for creating the Companies dataset and the PopQA subset is available on GitHub. The extraction prompts are detailed in Appendix D. The exact SFT hyperparameters (learning rate, epochs, batch size) are not explicitly listed in the main text.

📊 Experiments & Results

Evaluation Setup

Open-ended question answering evaluated by LLM-as-a-Judge (GPT-4o).

Benchmarks:

Companies (New knowledge memorization (fictional entities)) [New]
PopQA (Long-tail knowledge retrieval (Wikipedia subset))
MultiHop-RAG (Multi-hop reasoning over knowledge graph)

Metrics:

Accuracy (judged by GPT-4o, normalized 0-100)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on the Companies dataset (entirely new/fictional knowledge) showing Knowledge-Instruct's ability to teach unseen facts.
Companies	Accuracy	0.1	80.9	+80.8
Companies	Accuracy	4.1	80.9	+76.8
Companies	Accuracy	57.8	80.9	+23.1
Performance on PopQA (long-tail existing knowledge) showing ability to recall obscure facts.
PopQA	Accuracy	26.3	61.0	+34.7
PopQA	Accuracy	53.2	61.0	+7.8
MultiHop-RAG results (Complex reasoning). Note: 'Oracle' refers to providing correct context, testing reasoning over learned knowledge.
MultiHop-RAG (Oracle)	Accuracy	55.9	80.3	+24.4
MultiHop-RAG (Standard)	Accuracy	31.2	46.2	+15.0

Main Takeaways

Standard CPT fails catastrophically in low-data regimes (~100K tokens), often yielding near-zero learning of new facts.
Knowledge-Instruct effectively injects new knowledge (80% accuracy on fictional companies) where unsupervised methods fail.
The method preserves general reasoning and instruction-following abilities better than CPT, which often breaks chat templates.
It significantly enhances the model's ability to utilize retrieved context (Oracle MultiHop-RAG results), suggesting better internalization of the underlying information logic.
Llama-3.1-8B generally benefits more from CPT baselines than Phi-4-14B on PopQA, possibly due to Phi-4's reliance on synthetic data during its own pre-training.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Model (LLM) pre-training vs. fine-tuning
Familiarity with Instruction Tuning (SFT)
Knowledge of catastrophic forgetting in neural networks

Key Terms

CPT: Continual Pre-training—training an already pre-trained model on new data (usually domain-specific) to update its knowledge.

SFT: Supervised Fine-Tuning—training a model on input-output pairs (instructions and responses) to teach it how to follow tasks.

Catastrophic Forgetting: The tendency of a neural network to completely forget previously learned information when trained on new data.

RAG: Retrieval-Augmented Generation—systems that retrieve external documents to help an LLM answer questions.

Multi-hop reasoning: Answering questions that require connecting multiple pieces of information from different sources.

Long-tail knowledge: Facts that appear very infrequently in the training data (e.g., obscure historical events or unpopular entities).

Reversal curse: The phenomenon where an LLM trained on 'A is B' fails to answer 'What is B?' (i.e., 'B is A').

Entity extraction: Identifying specific names, places, or organizations within a text.

Paraphrasing: Rewriting the same fact in multiple different ways to increase data diversity for the model.