FACTOR: Factual Assessment via Corpus TransfORmation—the proposed framework for generating factuality benchmarks
IC-RALM: In-Context Retrieval-Augmented Language Models—augmenting LMs by prepending retrieved documents to the context without training
Perplexity: A measurement of how well a probability model predicts a sample; often used as a proxy for LM quality but shown here to imperfectly correlate with factuality
Edit-distance: A measure of dissimilarity between two strings (e.g., Levenshtein distance), used here to ensure false completions are similar to true ones
NLI: Natural Language Inference—determining if a hypothesis is true, false, or neutral given a premise; used here to filter generated contradictions
The Pile: A large-scale, diverse dataset for language modeling; the Wikipedia validation split is used for Wiki-FACTOR
InstructGPT: A model fine-tuned with human feedback (RLHF); used here as the generator for non-factual contradictions
Semantic frame error: Errors involving the main predicate or arguments (Entity, Predicate, Circumstance)
Discourse error: Errors involving relationships between sentences (Coreference, Link)