NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that accounts for the position of relevant items in a list
Plackett-Luce model: A probabilistic model that defines a distribution over rankings based on item scores, allowing the sampling of permutations to make ranking differentiable
Position bias: The tendency of LLMs to alter their output based on the order of inputs (e.g., favoring items at the start of a list)
Pointwise inference: Evaluating items one by one independently rather than all at once, used here to prevent position bias
PPO: Proximal Policy Optimization—an RL algorithm that updates policies stably by clipping the objective function
SFT: Supervised Fine-Tuning—training the model on labeled examples (here, synthesized reasoning traces) before RL optimization
REINFORCE: A basic policy gradient algorithm used here to optimize the scoring head
DeepSeek-R1: A reasoning-focused Large Language Model used in this paper as a teacher to synthesize training data for the SFT stage