RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents
LLM-as-a-judge: Using a powerful Language Model to evaluate the quality of outputs from another model, often correlating well with human judgment
Multi-hop QA: Questions that require combining information from multiple different documents or passages to answer correctly
Hallucination: When an LLM generates information that is factually incorrect or not supported by the retrieved context
MRR: Mean Reciprocal Rank—a statistic measure for evaluating any process that produces a list of possible responses, focusing on the rank of the first correct answer
NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that takes into account the position of relevant items in the retrieved list
SAS: Semantic Answer Similarity—a metric using cross-encoders to evaluate the semantic alignment between a generated answer and a reference answer
BERTScore: A metric that computes a similarity score for each token in the candidate sentence with each token in the reference sentence using contextual embeddings
CKA: Centered Kernel Alignment—a similarity index used to measure the similarity between representations (embeddings) of different models