GFlowNet: Generative Flow Network—a probabilistic framework that trains a policy to sample objects with probability proportional to their reward, rather than just maximizing reward
SFT: Supervised Fine-Tuning—training a pre-trained model on labeled data using maximum likelihood estimation
Process Reward: A reward signal provided at intermediate steps (e.g., per token) of generation, rather than only at the final outcome
DPO: Direct Preference Optimization—a method to align language models with preferences without a reward model, often used as a baseline here
Popularity Bias: The tendency of recommenders to suggest items that are globally frequent in the training data, ignoring personal relevance
Subtrajectory Balance: A loss function in GFlowNets that enforces flow consistency across segments of a trajectory, ensuring probabilities match rewards
Prefix Tree: A data structure representing all item titles where common beginning tokens share the same path; used here to calculate flow
DGU: Difference in Group Usage—a fairness metric measuring the discrepancy between recommended item popularity groups and historical data groups
NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that weights correct recommendations higher if they appear earlier in the list