DPO: Direct Preference Optimization—an alignment method that optimizes a policy directly on preference pairs without an explicit reward model loop
Mask-DPO: The proposed method which applies masks to the DPO objective to ignore incorrect parts of preferred answers and correct parts of rejected answers
FactScore: A metric that decomposes a generation into atomic facts and verifies what percentage are supported by a knowledge source (e.g., Wikipedia)
ANAH-v2: A fine-grained hallucination annotation model and dataset used to label sentence-level factuality
RLHF: Reinforcement Learning from Human Feedback—generic framework for aligning models using rewards derived from human preferences
hallucination: Generated content that is nonsensical or unfaithful to the source/world knowledge
FactTune: A baseline method that uses DPO for factuality alignment but relies on response-level factuality scores rather than masked sentence-level optimization
policy model: The language model being trained to generate responses
reference model: The original version of the model before alignment, used to prevent the trained model from drifting too far (via KL penalty)