LongFact: A newly proposed prompt set comprising 2,280 questions across 38 topics designed to elicit long-form, fact-heavy responses
SAFE: Search-Augmented Factuality Evaluator—an LLM agent workflow that decomposes text into facts and verifies them using Google Search
F1@K: A metric for long-form factuality that balances precision (percentage of supported facts) and recall (percentage of supported facts relative to a target number K)
FActScore: A prior metric/framework for evaluating long-form factuality by breaking text into atomic facts and verifying them against Wikipedia
atomic fact: A single, self-contained piece of information extracted from a longer sentence (e.g., 'The Eiffel Tower is in Paris')
hallucination: When a model generates content that is factually incorrect or nonsensical with respect to its internal knowledge or external reality
LLM agent: An LLM setup that can use tools (like Google Search) and perform multi-step reasoning to complete a task
SFT: Supervised Fine-Tuning—training a model on labeled examples
RLHF: Reinforcement Learning from Human Feedback—a training method to align models with human preferences