GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs generated from the same input, avoiding a separate value model
RLVR: Reinforcement Learning from Verifiable Reward—alignment using objective, programmatic rewards (like correct formatting or catalog inclusion) rather than human preference models
SFT: Supervised Fine-Tuning—training the model on labeled demonstrations to establish initial capabilities before RL
NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that gives more weight to correct items appearing earlier in the list
DCG: Discounted Cumulative Gain—the non-normalized sum of relevance scores discounted by their rank position
Credit Assignment: The problem of determining which past action is responsible for a received reward
Out-of-Catalog (OOC): Items generated by the model that do not exist in the system's valid item database
Behavior Cloning: Learning a policy by supervising it to mimic expert demonstrations (synonymous here with SFT)
KL Divergence: A statistical distance measure used here as a penalty to prevent the RL policy from drifting too far from the SFT starting point