NLI: Natural Language Inference—a task determining if a hypothesis is entailed by, contradicts, or is neutral to a premise.
ROUGE: Recall-Oriented Understudy for Gisting Evaluation—a set of metrics measuring token overlap between generated text and a reference summary.
SummaC: A zero-shot, NLI-based consistency scoring system originally designed for summarization, used here to detect hallucinations by checking sentence-level entailment.
Reference-based metrics: Evaluation methods that compare generated text against a gold-standard ground truth (e.g., Wikipedia article).
Pairwise metrics: Evaluation methods that compare a generated text against other samples generated by the same model to check for self-consistency.
Atomic facts: The smallest indivisible units of information in a sentence (e.g., 'Obama was born in Hawaii' contains facts about the person, action, and location).
Verifiable hallucination: Generated content that can be explicitly proven true or false based on the reference text.
Unverifiable hallucination: Generated content that is not present in the reference text, making it impossible to prove or disprove using that source alone.
High-resource languages: Languages with abundant training data available (e.g., English, Chinese, French).
Low-resource languages: Languages with limited training data available (e.g., Ukrainian, Persian).