RAG-QA: Retrieval-Augmented Generative Question Answering—systems that retrieve documents and generate a free-form answer
Lfrqa: Long-form RobustQA—the new dataset proposed in this paper with coherent long-form answers
RobustQA: A prior dataset containing short, extractive answer spans for RAG tasks
Elo rating: A ranking system originally for chess, used here to rank LLM performance based on pairwise win rates
CoT: Chain-of-Thought—a prompting technique encouraging models to 'think' step-by-step before answering
BioASQ: A biomedical semantic indexing and question answering challenge/dataset
ColBERTv2: A specific retrieval model architecture that uses late interaction of token embeddings
Pearson Correlation: A statistical measure of linear correlation between two sets of data (here, human vs. model scores)
Cohen's Kappa: A statistic that measures inter-annotator agreement for categorical items
F1 score: A metric balancing precision and recall, traditionally used for token overlap in extractive QA