CIR: Complementary-Item Recommendation—systems that suggest add-on items (e.g., a case for a phone) rather than substitutes
ScalingEval: The proposed framework for multi-agent, consensus-based evaluation of recommendation pairs
LLM-as-a-judge: Using Large Language Models to evaluate the quality of outputs from other systems, replacing human annotators
Majority Voting: A consensus mechanism where the final label is determined by the most frequent prediction among multiple models
Anchor-Recommendation Pair: A tuple consisting of a base product (anchor) and a suggested add-on product (recommendation) being evaluated
Conflict-Resolution Policy: A set of rules (Reject >> Major >> Minor >> Good) used to determine the final label when models disagree, prioritizing safety/rejection
Agreement Rate: The percentage of models that align with the majority decision, used as a proxy for the 'difficulty' or ambiguity of a specific test case