Cranfield paradigm: A standard evaluation framework from Information Retrieval where a fixed set of documents and queries are pooled and exhaustively judged for relevance to create a reusable test collection
Pooling: The process of collecting the top-k results from multiple diverse systems to form a candidate set for relevance judgment, ensuring high coverage of likely relevant items
Exposure bias: The tendency of historical data to reflect only what users were shown by previous systems, not what they might have liked if they had seen it
MNAR: Missing Not At Random—the pattern where missing ratings in a dataset are not random but reflect user choices (e.g., users only rate items they chose to consume)
Kendall's τ: A statistic used to measure the ordinal association between two measured quantities (here, the ranking of recommender systems produced by different judges)
Judged@100: The percentage of items in the top-100 recommendations that have a corresponding relevance label in the ground truth
Compatibility: A specialized evaluation metric (Compatibility measure) that handles graded relevance and user persistence, used here as the primary effectiveness metric