PretrainRL: Alleviating Factuality Hallucination of Large Language Models at the Beginning

📝 Paper Summary

Hallucination mitigation Continual pre-training

PretrainRL mitigates factual hallucinations by identifying high-probability falsehoods during the pre-training phase and down-weighting them using Direct Preference Optimization (DPO), thereby making room for learning low-probability truths.

Core Problem

Pre-training data often has imbalanced distributions where frequent 'head' knowledge dominates, causing models to learn marginal probabilities (e.g., 'shoes are red') rather than conditional truths, leading to 'high-probability falsehoods' that block the learning of 'tail' facts.

Why it matters:

Standard Next-Token Prediction (NTP) forces models to fit these biased distributions, embedding hallucinations deeply before fine-tuning even begins
Post-hoc fixes like 'I don't know' alignment or knowledge editing often cause catastrophic forgetting or merely mask the underlying issue rather than fixing the probability distribution root cause

Concrete Example: If a corpus mentions 'Brand A shoes are red' 1000x more often than 'Brand B shoes are blue', the model learns a shortcut associating 'shoes' with 'red'. When asked about Brand B, it hallucinates 'red' because the high-probability head knowledge (red) squeezes out the low-probability tail truth (blue).

Key Novelty

Debiasing Then Learning via Pre-training DPO

Applying Direct Preference Optimization (DPO) during the continual pre-training phase to actively reshape the probability distribution
Uses a 'debiasing' strategy: first lowers the probability of popular but incorrect answers (falsehoods) to 'make room' in the model's capacity, then boosts the probability of the correct tail knowledge
Introduces an efficient beam-search-based negative sampling method to automatically discover these 'high-probability falsehoods' without needing access to the original training corpus statistics

Architecture

Comparison of PretrainRL vs. standard methods and the core workflow. (Note: Paper does not have a dedicated architectural block diagram, but Figure 1 conceptualizes the performance gap and Section 3 describes the flow).

Evaluation Highlights

+15.6% Accuracy improvement on POPQA using Qwen3-4B-Base compared to standard Continued Training (CT)
+13.3% Accuracy improvement on Wikidata-Knowledge Infusion benchmark with Llama3-8B-Base compared to base model
Achieves superior performance on long-tail knowledge datasets without degrading general capabilities on benchmarks like MMLU and GSM8K

Breakthrough Assessment

8/10

Addresses the root cause of hallucination (data imbalance) during pre-training rather than post-hoc. The shift from post-training RL to pre-training RL for knowledge consolidation is methodologically significant.

⚙️ Technical Details

Problem Definition

Setting: Continual pre-training of Large Language Models to improve factual reliability

Inputs: Knowledge triples (subject, predicate, object) transformed into prompts

Outputs: Next-token predictions with reshaped probability distributions favoring factual truths over popular falsehoods

Pipeline Flow

Negative Sampling (Beam search to find high-prob falsehoods) → Dataset Construction (Pairs of Truth vs. Falsehood) → PretrainRL Optimization (Joint DPO + NTP loss)

System Modules

Negative Sampler

Identify 'head falsehoods' (incorrect answers the model assigns high probability to)

Model or implementation: Base LLM (e.g., Qwen3-Base)

Optimizer

Update model weights to down-weight falsehoods and up-weight truths

Model or implementation: Base LLM

Novel Architectural Elements

Integration of preference optimization (DPO) directly into the pre-training/continual pre-training loop for knowledge consolidation, rather than alignment

Modeling

Base Model: Qwen3-4B/7B/14B-Base and Llama3-8B-Base

Training Method: Continual Pre-training with joint DPO and NTP loss

Objective Functions:

Purpose: Reshape probability distribution by preferring truth over specific high-prob falsehoods.

Formally: L_DPO = -log σ(β * (log(π_θ(y_w|x)/π_ref(y_w|x)) - log(π_θ(y_l|x)/π_ref(y_l|x))))
Purpose: Maintain general language capabilities and anchor distribution to high-quality responses.

Formally: L_NTP = -log π_θ(y_w|x)
Purpose: Combined objective.

Formally: L = L_DPO + λ * L_NTP

Training Data:

Negative samples: Generated via beam search on base model (top 20 candidates, sample 5 negatives)
Datasets: POPQA (expanded 3x to 38k), EntityQuestions (expanded 2x to 34k), Wikidata infusion

Key Hyperparameters:

beta: 0.1
learning_rate: 3e-5 (POPQA/Wikidata), 2e-5 (EntityQuestions)
weight_decay: 0.1
+ 4 more
optimizer: AdamW
lambda (NTP weight): 0.2 (implied from typical CT implementations, paper says 'linearly weighted' but specific coefficient for NTP not explicitly enumerated in text, usually 1.0 or tuned)
batch_size: Not reported in the paper
epochs: 1

Compute: Eight H800 GPUs

Comparison to Prior Work

vs. Continued Training (CT): CT only learns truths; PretrainRL explicitly unlearns/down-weights competing high-prob falsehoods.
vs. DPO (Standard): Standard DPO is post-training for alignment; PretrainRL uses it for knowledge injection with specific negative sampling.
vs. Knowledge Editing: PretrainRL targets global probability distribution during training rather than local parameter patches, avoiding scalability issues.
+ 1 more
vs. RPO: PretrainRL focuses on the pre-training stage to consolidate knowledge rather than post-training refinement [not cited in paper].

Limitations

Depends on the availability of knowledge triples to construct prompts.
Requires generating negative samples via inference (beam search) which adds computational overhead compared to simple text training.
Primary experiments focused on short-form factoid QA; applicability to long-form reasoning less explored.

Reproducibility

Prompt templates for negative sampling provided in Appendix. Datasets are public (POPQA, EntityQuestions). Hyperparameters for LR and beta are provided. Negative sampling implementation details (beam size 50) provided. Code URL not provided in paper.

📊 Experiments & Results

Evaluation Setup

Evaluated on ability to recall factual knowledge (QA) and general capability benchmarks.

Benchmarks:

POPQA (Long-tail Factoid QA)
EntityQuestions (Factoid QA)
Wikidata-knowledge infusion (Factoid QA)
MMLU (General Knowledge)
GSM8K (Math Reasoning)

Metrics:

Accuracy (Acc)
Hit Ratio (HR@k)
Mean Reciprocal Rank (MRR@k)
Probability (Prob@k)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on POPQA showing PretrainRL outperforms baselines across different model sizes.
POPQA	Accuracy	32.0	47.6	+15.6
POPQA	Accuracy	24.6	47.6	+23.0
Wikidata-knowledge infusion	Accuracy	39.4	52.7	+13.3
Large scale experiments on EntityQuestions (1.75M questions) confirm scalability.
EntityQuestions	Accuracy	29.9	47.3	+17.4
Ablation studies validating the components of the loss function.
POPQA	Accuracy	45.0	47.6	+2.6
POPQA	Accuracy	32.0	47.6	+15.6
General capability check to ensure no catastrophic forgetting.
MMLU	Score	65.5	66.5	+1.0

Experiment Figures

Probability distribution of candidate answers for a specific question before and after training.

Main Takeaways

Factual hallucinations stem from pre-training data imbalance where 'head' falsehoods dominate 'tail' truths.
PretrainRL's 'debias then learn' strategy effectively reshapes probability distributions, making room for tail knowledge.
The method generalizes well across model architectures (Qwen, Llama) and sizes (4B, 7B, 14B).
Unlike SFT which can degrade general capabilities (e.g., math, reasoning), PretrainRL preserves or improves performance on benchmarks like MMLU and GSM8K.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Next-Token Prediction (NTP) loss
Familiarity with Direct Preference Optimization (DPO)
Knowledge of Reinforcement Learning (RL) concepts in LLM training
Concept of long-tail vs. head data distributions

Key Terms

DPO: Direct Preference Optimization—a method to align language models by increasing the likelihood of preferred outputs over rejected ones without a separate reward model

NTP: Next-Token Prediction—the standard self-supervised learning objective where models predict the next word in a sequence

CT: Continued Training—further training a pre-trained model on specific data using the original pre-training objective

Beam Search: A search algorithm that explores a graph by expanding the most promising node in a limited set, used here to find high-probability wrong answers

Head Knowledge: Frequently occurring facts in the training corpus that the model learns easily

Tail Knowledge: Rarely occurring facts that the model struggles to memorize due to the dominance of head knowledge

Hallucination: When an LLM generates content that is nonsensical or unfaithful to the provided source or real-world facts

SFT: Supervised Fine-Tuning—training on labeled instruction-response pairs

PretrainRL: The proposed framework integrating reinforcement learning into the pre-training phase to consolidate factual knowledge