RLPA: Reinforcement Learning for Personalized Alignment—the proposed framework using simulated users and dual rewards.
MDP: Markov Decision Process—a mathematical framework for modeling decision-making where outcomes are partly random and partly under the control of a decision maker.
Cold-start: The scenario where a system must serve a new user without having any prior historical data or profile for them.
SFT: Supervised Fine-Tuning—training a model on a fixed dataset of inputs and target outputs.
DPO: Direct Preference Optimization—a method to align models to preferences without a separate reward model, typically using static pairs of chosen/rejected responses.
ALOE: A benchmark for evaluating personalized dialogue systems, containing dialogues annotated with user profiles.
Slot-value format: A structured representation of information where specific categories (slots) are assigned specific contents (values).
PPO: Proximal Policy Optimization—a policy gradient RL algorithm that optimizes the model while preventing drastic updates that could destabilize training.