Factuality: The consistency of LLM generated content with established facts (commonsense, world knowledge, domain facts).
Hallucination: The generation of content that is nonsensical or untruthful in relation to sources; distinct from factuality as it includes faithful but irrelevant details.
Snowballing: An inference-level error where an LLM commits to an initial incorrect claim and then generates further consistent but incorrect details to support it.
Retrieval-Augmented Generation (RAG): A method to enhance LLMs by retrieving relevant documents from external sources to ground the generation.
SFT: Supervised Fine-Tuning—training the model on labeled instruction-following data.
MMLU: Measuring Massive Multitask Language Understanding—a benchmark evaluating models on tasks covering STEM, the humanities, and social sciences.
TruthfulQA: A benchmark specifically designed to measure whether language models generate truthful answers to questions known to elicit false beliefs.
USMLE: United States Medical Licensing Examination—a set of standardized tests used to assess medical competency.
CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps.
RLHF: Reinforcement Learning from Human Feedback—aligning models using rewards derived from human preferences.