GRPO: Group Relative Policy Optimization—a policy optimization algorithm that normalizes advantages within a group of sampled outputs for the same prompt, removing the need for a separate value function critic
SFT: Supervised Fine-Tuning—training a pre-trained model on labeled instruction-response pairs
LLM-as-a-Judge: Using a strong LLM (like GPT-4) to evaluate the quality or correctness of another model's output
Pass@k: An evaluation metric measuring the probability that at least one of the top-k generated solutions is correct
fastText: A library for efficient text classification and representation learning, used here to identify code-related documents
RAG: Retrieval-Augmented Generation—enhancing model responses by retrieving relevant documents from an external knowledge base
FSDP: Fully Sharded Data Parallel—a memory optimization technique for training large models by sharding model parameters across GPUs
vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs