Instruction-tuned Language Models are Better Knowledge Learners

📝 Paper Summary

Continual Pre-training Knowledge Internalization

Pre-instruction-tuning (PIT), where models learn QA patterns before or alongside new documents, significantly improves the ability of LLMs to absorb and retrieve factual knowledge compared to standard post-training recipes.

Core Problem

Standard continued pre-training followed by instruction-tuning fails to effectively elicit knowledge from new documents, even when document perplexity is minimized (the 'perplexity curse').

Why it matters:

LLMs need to update their static knowledge base with evolving information without expensive full re-training
Current methods result in models that can recite documents (low perplexity) but cannot answer questions about the facts contained within them
Retrieving knowledge stored in parameters remains significantly less effective than open-book settings, limiting the utility of continual learning

Concrete Example: When a model is trained on a document stating 'Editing was handled by Jennifer Lame', it often fails to answer 'Who handled the editing of Oppenheimer?' despite the document text being perfectly memorized, because the model hasn't learned to link the concept of 'editing' in the document to the specific question format.

Key Novelty

Pre-instruction-tuning (PIT)

Invert the standard order: Expose the LLM to Question-Answer (QA) pairs *before* or *during* the encoding of complex documents
Prioritizing 'how to access knowledge' (via straightforward QA pairs) primes the model to better encode information from complex, cluttered documents during subsequent training
Use a specific curriculum: Train on QA pairs first, then a mix of QA pairs and documents, ensuring the retrieval mechanism is established before the knowledge storage occurs

Architecture

A comparison of different training recipes (1-8), showing the order of data exposure (QA pairs vs Documents) and their combinations

Evaluation Highlights

+17.8% accuracy improvement on Llama-2 7B (48.1% vs 30.3%) using PIT++ compared to standard instruction-tuning on the Wiki2023 dataset
+16.3% accuracy improvement on Llama-2 70B (62.7% vs 46.4%) using PIT++ compared to standard instruction-tuning
Cross-domain generalization: PIT trained on non-film domains still outperforms standard instruction-tuning when tested on the film domain (38.8% vs 30.3% for 7B)

Breakthrough Assessment

8/10

Identifies a critical failure mode in standard continual learning ('perplexity curse') and provides a strong, counter-intuitive solution (training on instructions before knowledge) with substantial empirical gains.

⚙️ Technical Details

Problem Definition

Setting: Continual knowledge acquisition where an LLM is updated on a new corpus D_new and evaluated on its ability to answer questions Q_new related to D_new

Inputs: New documents (e.g., Wikipedia 2023 articles) and synthetic QA pairs derived from them

Outputs: Answers to factual questions based on the new documents

Pipeline Flow

Data Generation: Generate synthetic QA pairs from raw documents using an existing LLM
Phase 1 (Access Learning): Train on QA pairs only to learn retrieval patterns
Phase 2 (Knowledge Encoding): Train on mixed batches of QA pairs and Raw Documents to internalize facts

System Modules

Data Generator

Create diverse questions and answers from raw text to serve as instruction data

Model or implementation: Publicly available LLMs (specific model not named for generation, likely Llama-2 or similar)

Learner Model

Absorb new knowledge and learn to answer questions

Model or implementation: Llama-2 (7B and 70B)

Novel Architectural Elements

Curriculum learning strategy: Ordering QA data before or interleaved with Document data specifically to mitigate the perplexity curse

Modeling

Base Model: Llama-2 (7B and 70B)

Training Method: Continued Pre-training and Instruction Tuning

Objective Functions:

Purpose: Learn from documents.

Formally: L_d = - sum_t log P(d_t | d_<t) / |d|
Purpose: Learn from QA pairs.

Formally: L_a = - sum_t log P(a_t | q, a_<t) / |a| (loss computed only on answer tokens)

Training Data:

Wiki2023 dataset (documents from Wikipedia Category:2023)
Wiki2023-film-test: 256 articles
Wiki2023-film-train: 1720 articles
Wiki2023-other-train: diverse domains

Key Hyperparameters:

batch_size: 256 (documents and QA pairs)
learning_rate_documents: 3e-5
learning_rate_qa: 5e-6
+ 3 more
epochs: 3 (for PIT stages)
optimizer: AdamW (implied by Llama-2 usage)
scheduler: Cosine decay to 10%

Comparison to Prior Work

vs. Standard Instruction-Tuning: PIT reverses the order (QA -> Doc or QA+Doc) rather than Doc -> QA, significantly reducing forgetting and improving access
vs. Zhu and Li (2023a): Validates on modern pre-trained LLMs (Llama-2) rather than randomly initialized small transformers; proposes practical 'QA first' curriculum rather than just mixing data

Limitations

Depends on the quality of synthetically generated QA pairs for the new documents
Requires processing the new corpus to generate QA pairs before training, adding computational overhead
Complete avoidance of overlap between pre-training data and Wiki2023 is difficult to guarantee absolutely
Experiments focused primarily on Llama-2; generalization to other architectures not explicitly tested

Reproducibility

The paper describes the Wiki2023 construction methodology and prompts in detail. The specific code URL is not provided in the abstract or introduction, but the dataset creation process is reproducible with the provided prompts.

📊 Experiments & Results

Evaluation Setup

Closed-book QA on newly learned documents (Wiki2023-film)

Benchmarks:

Wiki2023-film-test-QA (Factual QA) [New]
Natural Questions (NQ) (General QA (for measuring retention))

Metrics:

Exact Match (EM) accuracy
Perplexity (PPL)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on Llama-2 7B showing the impact of different training strategies on absorbing knowledge from Wiki2023-film documents.
Wiki2023-film-test-QA	EM Accuracy	30.3	48.1	+17.8
Wiki2023-film-test-QA	EM Accuracy	27.2	48.1	+20.9
Main comparison on Llama-2 70B showing consistent scaling of the PIT method.
Wiki2023-film-test-QA	EM Accuracy	46.4	62.7	+16.3
Cross-domain experiments showing generalization when training on one domain and testing on another.
Wiki2023-film-test-QA	EM Accuracy	30.3	38.8	+8.5

Experiment Figures

Training dynamics: QA accuracy vs. Document Perplexity (PPL) across epochs and learning rates

Main Takeaways

The 'Perplexity Curse' exists: Minimizing document perplexity does not guarantee the model can answer questions about the document.
PIT (training on QA before/with documents) significantly outperforms standard post-training (training on documents then QA).
Explicitly teaching the model 'how to access' knowledge (via QA pairs) before 'what to encode' (via documents) is the optimal strategy.
PIT enhances the ability to absorb knowledge even from documents in different domains than the instruction data.

📚 Prerequisite Knowledge

Prerequisites

Standard LLM training pipeline (Pre-training -> SFT)
Concept of Perplexity (PPL) as a training metric
Catastrophic forgetting in continual learning

Key Terms

Perplexity Curse: The phenomenon where an LLM achieves low perplexity (perfect memorization) on a document but fails to answer questions about the facts contained within it

PIT: Pre-instruction-tuning—a method where the model is tuned on questions *before* or *during* training on the associated raw documents

PIT++: An improved variant of PIT that trains exclusively on QA pairs first, followed by a mix of QA pairs and documents

Wiki2023: A dataset constructed by the authors containing Wikipedia articles from 2023 to ensure minimal overlap with Llama-2's pre-training data

Exact Match (EM): An evaluation metric that checks if the generated answer is character-for-character identical to the ground truth

SFT: Supervised Fine-Tuning (Instruction Tuning)—training the model on input-output pairs to follow instructions

NLL: Negative Log-Likelihood—the standard loss function used for training language models