RLHF: Reinforcement Learning with Human Feedback—training AI using a reward signal derived from human data (or proxies) rather than just correct/incorrect labels
PPO: Proximal Policy Optimization—an RL algorithm that improves the model's policy in stable steps, preventing it from changing too drastically at once
Implicit Feedback: User signals that are not explicit ratings, such as time spent reading (dwell time), clicks, or tone of voice (sentiment)
CRS: Conversational Recommender Systems—AI that suggests items through natural language dialogue rather than static lists
NDCG: Normalized Discounted Cumulative Gain—a metric measuring ranking quality, giving higher scores to relevant items appearing at the top of the list
Hit Rate: The percentage of times the correct or relevant item appears in the top-K recommendations
SFT: Supervised Fine-Tuning—the initial training phase using standard labeled data before RLHF is applied
RoBERTa: A robustly optimized BERT pretraining approach—used here as a classifier to detect sentiment changes in text