hallucination: A phenomenon where LLMs generate seemingly convincing but factually erroneous responses
preference learning: Fine-tuning models using pairs of preferred (better) and dispreferred (worse) outputs to steer behavior
under-alignment: A failure mode where the tuning process is too superficial, causing no significant behavior change in out-of-domain settings
over-alignment: A failure mode where the model learns spurious features (e.g., style) rather than the intended task, leading to poor generalization
atomic preferences: Preference pairs constructed at the granularity of individual facts/sentences rather than entire paragraphs
FActScore: An automated metric that breaks generations into atomic facts and verifies each against a knowledge base
DPO: Direct Preference Optimization—a stable method for preference learning that optimizes a classification loss without a separate reward model
shifted tokens: Tokens whose probability rank changes significantly after fine-tuning compared to the base model, used as a proxy for behavioral change