DPO: Direct Preference Optimization—an alignment method optimizing a policy to prefer chosen answers over rejected ones without a separate reward model
SFT: Supervised Fine-Tuning—training the model on positive examples (user history -> target item) using standard cross-entropy loss
Self-Play: A training mechanism where the model improves by interacting with its own previous versions; here, using its own outputs as negative feedback
MGU: Missing Group Utility—a fairness metric measuring the distribution mismatch between ground-truth user preferences and model recommendations
Reverse KL-divergence: A statistical measure minimized by DPO that encourages mode-seeking behavior (focusing on peaks), often leading to popularity bias
Forward KL-divergence: A statistical measure minimized by SFT that encourages mass-covering behavior (averaging the distribution), generally less biased than reverse KL
Filter bubble: A state where a recommender system isolates a user in a cultural or ideological bubble by showing only items they are already likely to agree with or know
LRS: LLM-based Recommendation System—using Large Language Models to perform recommendation tasks