factuality hallucination: Generating content that conflicts with established real-world knowledge
faithfulness hallucination: Generating content that conflicts with or is not supported by the provided source context/knowledge
spontaneous hallucination: Hallucinations naturally produced by an LLM when attempting to answer a query
induced hallucination: Hallucinations produced when an LLM is explicitly guided or tricked (e.g., by malicious instructions) into generating false information
HalluJudge: The specialized judge language model developed in this paper, fine-tuned on HalluDial to detect and explain hallucinations
LLM-as-a-Judge: Using a powerful Large Language Model to evaluate the outputs of other models instead of human annotators
ROUGE-L: A metric measuring text overlap based on the longest common subsequence, used here to check if the model identifies the correct hallucinated span
Cohen's Kappa: A statistical measure of inter-annotator agreement for qualitative items
Macro F1: An average of F1 scores calculated for each class (hallucinated vs. non-hallucinated), treating all classes equally
BERTScore: An automatic evaluation metric that computes a similarity score for each token in the candidate sentence with each token in the reference sentence using contextual embeddings