DPO: Direct Preference Optimization—a stable method to fine-tune LMs on preference pairs by optimizing a classification loss, avoiding explicit reward modeling
RLHF: Reinforcement Learning from Human Feedback—training models using rewards derived from human preferences
FactScore: An automated evaluation metric that breaks text into atomic claims and verifies each against Wikipedia using a retrieval system
atomic claims: The smallest indivisible statements of fact within a longer text (e.g., 'Yo-Yo Ma plays cello' is atomic; 'Yo-Yo Ma is a French-born cellist' contains two atomic claims)
calibration: The property where a model's predicted confidence probability matches its actual accuracy frequency
FactTune-FS: The paper's method using FactScore (reference-based) to generate preference labels
FactTune-MC: The paper's method using Model Confidence (reference-free) to generate preference labels
SFT: Supervised Fine-Tuning—standard training on high-quality demonstration data
ITI: Inference-Time Intervention—a technique that shifts model activations during inference to improve truthfulness
DOLA: Decoding by Contrasting Layers—a decoding strategy that amplifies factual knowledge by contrasting outputs from different model layers
semantic entropy: A measure of uncertainty that clusters generated answers by meaning rather than exact token match
Bradley-Terry model: A statistical model predicting the probability that one item is preferred over another based on their underlying reward scores