RAG: Retrieval-Augmented Generation—AI systems that answer questions by searching for external documents before generating a response
PPO: Proximal Policy Optimization—an RL algorithm that updates model policies in stable steps to maximize a reward function
CoT: Chain of Thought—prompting the model to generate intermediate reasoning steps (thinking) before the final answer
SFT: Supervised Fine-Tuning—training the model on labeled examples to learn a specific output format before RL optimization
KL divergence: A statistical measure used in RL to prevent the trained model from drifting too far from its original behavior
cold-start model: The initial model state (after SFT) used as the starting point for reinforcement learning; crucial for training stability
BGE: BAAI General Embedding—a specific pre-trained model used to convert text into vector representations for retrieval
GAE: Generalized Advantage Estimation—a method in RL to estimate how good an action is by balancing bias and variance
Exact Match (EM): A strict evaluation metric that counts a prediction as correct only if it effectively matches the ground truth string exactly