RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents
Support: A metric evaluating whether the information in a generated sentence is factually backed up by its cited document
LLM-as-a-judge: Using a Large Language Model (like GPT-4) to evaluate the quality of outputs from other models
Kendall's tau: A statistic used to measure the ordinal association between two measured quantities (e.g., how similarly two judges rank a list of systems)
Cohen's kappa: A statistic that measures inter-annotator agreement for qualitative items, accounting for the possibility of the agreement occurring by chance
weighted precision: A metric in this paper measuring the proportion of citations that support the answer, weighted by support level (1.0 for Full, 0.5 for Partial)
weighted recall: A metric in this paper measuring the proportion of answer sentences supported by citations, weighted by support level
TREC: Text Retrieval Conference—a series of workshops focusing on a list of different information retrieval research areas
post-editing: An annotation workflow where humans review and correct pre-generated labels rather than creating them from scratch