VeriScore: A high-quality but slow factuality evaluation pipeline that decomposes text into atomic claims and verifies them individually against Google Search results
atomic claims: Short, single-fact statements extracted from a longer text, containing one event or state
factual precision: The proportion of verifiable claims in a response that are supported by evidence
factual recall: The number of supported claims relative to the median number of claims across responses (penalizing verbosity or reticence)
F1@K: A harmonic mean of factual precision and recall used to score model factuality
SERPER API: A search engine API (Google Search wrapper) used to retrieve evidence snippets
Pearson correlation: A statistical measure (r) of linear correlation between two sets of data (here, agreement between evaluator scores)
distillation: The process of training a smaller or faster model (student) to replicate the behavior of a larger or more complex system (teacher)