RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents
Hallucination: Generated content that is nonsensical or unfaithful to the provided source content (in the context of RAG)
Faithfulness: The degree to which a generated response is strictly supported by the provided source context
LLM-as-a-judge: Using a Large Language Model to evaluate the quality or correctness of outputs from other models
Zero-shot: Asking a model to perform a task without providing any specific training examples in the prompt
Chain-of-Thought (CoT): Prompting technique where the model is asked to generate intermediate reasoning steps before the final answer
NLI: Natural Language Inference—determining if a hypothesis is entailed by, contradicts, or is neutral to a premise
HHEM: Hughes Hallucination Evaluation Model—a specific hallucination detection model developed by Vectara
F1-macro: The arithmetic mean of F1 scores calculated for each class (e.g., Consistent and Hallucinated), treating all classes equally
Balanced Accuracy: The arithmetic mean of sensitivity (true positive rate) and specificity (true negative rate), useful for imbalanced datasets