verbalization: The process of converting structured data (like user logs) into natural language text for LLM input
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by normalizing rewards within a group of sampled outputs, avoiding the need for a separate value function
Recall@1 for Discovery: A metric measuring how often the model correctly predicts a relevant item that the user has not previously watched
Oracle Reasoner: A powerful, fixed LLM used during training to provide reward signals to the Verbalizer, ensuring the Verbalizer learns robust representations
plateau function: A reward function component that keeps values high within a specific range (e.g., length ratio 0.3-0.7) and penalizes values outside it
cold-start items: Items with little or no historical interaction data, making them difficult for traditional collaborative filtering to recommend
heterogeneous feature space: Data containing mixed types of information (time, text, categorical IDs, continuous numbers) that can be difficult for models to process uniformly