DPO: Direct Preference Optimization—a method to align language models by increasing the likelihood of preferred outputs over rejected ones without a separate reward model
NTP: Next-Token Prediction—the standard self-supervised learning objective where models predict the next word in a sequence
CT: Continued Training—further training a pre-trained model on specific data using the original pre-training objective
Beam Search: A search algorithm that explores a graph by expanding the most promising node in a limited set, used here to find high-probability wrong answers
Head Knowledge: Frequently occurring facts in the training corpus that the model learns easily
Tail Knowledge: Rarely occurring facts that the model struggles to memorize due to the dominance of head knowledge
Hallucination: When an LLM generates content that is nonsensical or unfaithful to the provided source or real-world facts
SFT: Supervised Fine-Tuning—training on labeled instruction-response pairs
PretrainRL: The proposed framework integrating reinforcement learning into the pre-training phase to consolidate factual knowledge