RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents
LLM-as-a-judge: Using a strong Large Language Model to evaluate the outputs of other models instead of human annotators
BioASQ: A large-scale biomedical semantic indexing and question answering challenge/dataset
Zero-shot: The model performs the task (evaluation) without being trained on specific examples of that task
Claim extraction: An intermediate step in some evaluation pipelines where complex sentences are broken down into atomic assertions
NLI: Natural Language Inference—determining if a hypothesis is true (entailment), false (contradiction), or unrelated (neutral) given a premise
BLEU: Bilingual Evaluation Understudy—a metric measuring n-gram overlap between generated text and reference text
ROUGE: Recall-Oriented Understudy for Gisting Evaluation—a metric measuring overlap (often Longest Common Subsequence) between generated text and reference
BERTScore: A metric computing semantic similarity using contextual embeddings rather than exact word matching
Exact Match (EM): A strict metric where the generated answer must be character-for-character identical to the ground truth
Prediction-Powered Inference (PPI): A statistical technique used to correct for bias when using model predictions (like from an LLM judge) to estimate population properties