FActScore: A metric that breaks long text into atomic claims and verifies each claim against a knowledge source to calculate a factuality percentage
WildChat: A dataset of 1 million real-world user-chatbot interactions used as the source for mining entities
atomic claim: A single, indivisible piece of information extracted from a longer text (e.g., 'Obama was born in Hawaii') used for precise verification
perplexity: A measurement of how surprised a model is by a sequence of text; used here to approximate how rare an entity is (higher perplexity = rarer entity)
RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents
WildFActScore-Strict: A stricter metric defined in this paper that assigns a score of 1 only if ALL atomic facts are correct, and 0 if ANY fact is wrong or the model abstains
WildFActScore: The percentage of atomic facts in a response supported by the knowledge source, averaged only over non-abstaining responses
abstention: When a model refuses to answer a query (e.g., 'I don't know about this person') rather than generating potentially false information
Google Custom Search JSON API: A tool used to programmatically retrieve search results from Google, used here to build knowledge sources for entities