Pre-training Large Memory Language Models with Internal and External Knowledge

📝 Paper Summary

Factual Knowledge in LLMs Knowledge Externalization Modularized RAG pipeline

LmLm (Limited Memory Language Models) modify the pre-training process to mask factual values from the loss, forcing the model to learn database lookup calls rather than memorizing facts in weights.

Core Problem

Standard LLM pre-training entangles linguistic competency with factual knowledge in neural weights, making it difficult to update specific facts, unlearn sensitive data, or verify information sources.

Why it matters:

Factual errors (hallucinations) are hard to fix because knowledge is distributed across billions of opaque parameters
Removing outdated or inappropriate knowledge (machine unlearning) typically degrades general model utility or fails to fully erase the data
Reliable memorization requires observing facts many times during training, which is inefficient for long-tail knowledge

Concrete Example: A customer service agent for a restaurant chain should not answer questions about real estate law, but standard LLMs cannot easily isolate or remove this training data. Furthermore, traditional RAG models often fall back on internal (potentially hallucinatory) knowledge when retrieval fails.

Key Novelty

Pre-training with Factual Masking

Annotates the pre-training corpus with explicit database lookup calls for entity-level facts (triplets)
During pre-training, the model is trained to generate the lookup call, but the actual factual values (the answer) are masked from the loss calculation
This forces the model to rely on external retrieval rather than memorization, effectively 'hollowing out' the parametric memory of facts

Architecture

Overview of the LmLm pre-training process compared to standard pre-training. Shows how facts are extracted to a database and replaced with lookup calls in the training data.

Evaluation Highlights

LmLm (382M parameters) matches the factual precision of LLaMA2-7B on benchmarks, despite being ~18x smaller
Achieves instant, perfect unlearning on the TOFU benchmark without degrading model utility (Retain Set performance), unlike gradient-based methods
Reduces perplexity by ~1.98 points (Dynamic setting) compared to standard models of the same size, indicating learning to look up is more efficient than memorizing

Breakthrough Assessment

8/10

Offers a fundamental shift in pre-training objectives by explicitly discouraging memorization. Solves the unlearning problem structurally rather than algorithmically.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive language modeling with access to an external key-value database

Inputs: Context sequence tokens

Outputs: Next token prediction or database lookup call

Pipeline Flow

Input Processing
Lookup Generation (if fact needed)
Retrieval
Generation Continuation

System Modules

Base LM

Generates text and decides when to emit lookup tokens

Model or implementation: GPT-2 or LLaMA-2 architectures (modified vocab)

Database Interface

Executes the query generated by the LM against the external store

Model or implementation: Fuzzy matching with cosine similarity (all-MiniLM-L6-v2 embeddings)

Novel Architectural Elements

Integration of lookup tokens (<|db_start|>, <|db_retrieve|>, etc.) directly into pre-training vocabulary and data flow
Loss masking mechanism applied specifically to factual value tokens during pre-training

Modeling

Base Model: GPT-2 and LLaMA-2 style architectures (up to 1.5B parameters)

Training Method: Pre-training from scratch with custom loss masking

Objective Functions:

Purpose: Minimize negative log-likelihood of text and lookup calls, but exclude factual values.

Formally: Standard next-token prediction loss where tokens corresponding to retrieved values and the ending lookup token are masked (weight = 0).

Training Data:

OLMo2 Wikipedia corpus (~3B tokens)
Data annotation pipeline: Seed annotation (GPT-4o) -> Filtering (Corrector model) -> Scale annotation (Annotator model)
Resulting database: 54.6M knowledge triplets

Key Hyperparameters:

epochs: 8
context_length: 1024
precision: Mixed precision

Compute: Each model completes training within 8 H100-days

Comparison to Prior Work

vs. RAG: LmLm prevents facts from being stored in weights, enabling instant unlearning; RAG models still have parametric memory
vs. Toolformer: LmLm applies masking to the *result* of the tool call during pre-training to enforce externalization
vs. MemSinks [not cited in paper]: MemSinks isolates memories into specific tokens for easier unlearning; LmLm moves them entirely outside the model

Limitations

Current implementation focuses only on entity-level atomic factual knowledge (triplets)
Depends on the quality and coverage of the automated annotation pipeline for database construction
Incurs upfront costs for data annotation and constructing the external database

Reproducibility

Code: https://github.com/kilian-group/LMLM

Code and models to be open-sourced at https://github.com/kilian-group/LMLM. Pre-training corpus is publicly available (OLMo2).

📊 Experiments & Results

Evaluation Setup

Language modeling, factual precision benchmarks, and machine unlearning scenarios

Benchmarks:

Wikipedia Validation Set (Language Modeling (Perplexity))
TOFU (Machine Unlearning)
FactScore (Long-form biography generation)
T-REx (Short-form factual completion)
PopQA (Long-tail QA)

Metrics:

Perplexity (Static, Dynamic, Normalized)
Model Utility (ROUGE, Probability, Truth Ratio)
Forget Quality (p-value)
FactScore (%)
Exact Match (EM)
Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Factual precision comparisons showing LmLm outperforms standard models of the same size and approaches much larger models.
FactScore	%	13.5	31.4	+17.9
T-REx	EM	20.6	26.7	+6.1
PopQA	Accuracy	14.4	42.5	+28.1
Perplexity results showing LmLm is more efficient at modeling text when allowed to look up facts.
Wikipedia Validation	Dynamic Perplexity	Not reported in the paper	Not reported in the paper	Not reported in the paper

Experiment Figures

Comparison of unlearning performance on TOFU benchmark between LmLm and Gradient-based/RL-based methods (NPO, GA, GD).

Main Takeaways

LmLm achieves competitive performance compared to significantly larger LLMs (e.g., 382M LmLm matching 7B LLaMA2 in factual precision).
Decoupling knowledge allows for perfect unlearning by simply deleting database entries, preserving model utility where other methods (NPO) degrade it.
Learning to lookup facts is empirically easier for the model than memorizing them, reflected in faster convergence and lower perplexity.
LmLm preserves general NLU capabilities, performing on par with standard models on tasks like ARC, HellaSwag, and MMLU.

📚 Prerequisite Knowledge

Prerequisites

Understanding of standard autoregressive Language Model pre-training
Retrieval-Augmented Generation (RAG) concepts
Machine Unlearning benchmarks (TOFU)

Key Terms

LmLm: Limited Memory Language Models—models pre-trained to query databases for facts rather than memorizing them

Lookup Masking: A pre-training technique where tokens corresponding to factual values are excluded from the loss, discouraging the model from learning them

Parametric Knowledge: Information stored implicitly in the neural weights of the model

Non-parametric Knowledge: Information retrieved from external sources (databases, documents) during inference

TOFU: Task of Fictitious Unlearning—a benchmark for evaluating machine unlearning methods on synthetic author profiles

FactScore: A metric measuring the proportion of atomic facts in generated text that are supported by a trusted source

Perplexity: A measurement of how well a probability model predicts a sample; lower is better