SFT: Supervised Fine-Tuning—training a model to mimic a dataset of inputs and targets (e.g., mimicking GPT-4's query rewrites)
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates baselines from a group of outputs for the same input, reducing the need for a separate value network
NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that considers the position of relevant items in a list
BM25: Best Matching 25—a standard bag-of-words retrieval function that ranks documents based on term frequency and inverse document frequency
Cold-start: A scenario where the system has little or no prior interaction data for a user or item
Transductive setting: Evaluation where test items were seen during training
Inductive setting: Evaluation where test items were NOT seen during training (testing generalization)
KL divergence: Kullback-Leibler divergence—a statistical distance measuring how one probability distribution differs from another
BLAIR: A dense retrieval model used as a baseline and backbone in the experiments
Cross-encoder: A ranking model that processes query and document simultaneously to output a relevance score (accurate but slow)
PPO: Proximal Policy Optimization—a popular reinforcement learning algorithm (mentioned as a contrast to GRPO)
IFEval: Instruction Following Evaluation—a benchmark measuring an LLM's ability to follow explicit constraints and formatting instructions