FActScore: A metric that decomposes text into atomic facts and verifies them; originally designed for biography generation
SAFE: Search-Augmented Factuality Evaluator—a metric that uses LLMs to decompose text and verify claims using Google Search
atomic claims: Short statements containing a single piece of information, used as the unit of verification in factuality metrics
verifiable claims: Claims describing a single event or state with necessary modifiers that can plausibly be proven true or false, excluding subjective opinions or metaphors
LFQA: Long-Form Question Answering—tasks requiring detailed, multi-sentence responses
F1@K: A metric balancing factual precision (supported claims / total claims) and recall (supported claims / K), where K is the median number of claims in model responses
open-weight models: Models whose weights are publicly released (e.g., Llama-3, Mixtral), allowing local execution and fine-tuning
sliding window: A technique using surrounding sentences as context during extraction to resolve references (like pronouns) without rewriting
Spearman correlation: A statistical measure of rank correlation, used here to compare automatic metric rankings with human judgment rankings