InfoNCE: A contrastive loss function used to learn representations by pulling positive pairs together and pushing negative pairs apart in vector space
GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that normalizes advantages within a group of sampled responses to stabilize training without a separate value network
embed_token: A special token added to the LLM's vocabulary; the hidden state at this token's position becomes the dense vector representation of the text
Chain-of-Thought (CoT): A prompting technique where the model generates intermediate reasoning steps before producing a final answer
DCG: Discounted Cumulative Gain—a measure of ranking quality that gives more weight to relevant items appearing earlier in the result list
Supervised Fine-Tuning (SFT): The process of training a pre-trained model on a labeled dataset to adapt it to a specific task
KL divergence: Kullback-Leibler divergence—a statistical distance measuring how one probability distribution differs from a reference distribution, used here to prevent the model from drifting too far from its base behavior
Triplet Margin Loss: A loss function that ensures the distance between a query and a positive document is smaller than the distance to a negative document by at least a fixed margin