PPO: Proximal Policy Optimization—a reinforcement learning algorithm that updates a policy in small, stable steps
Cold Start: Initial supervised fine-tuning phase using synthetic data to teach the model the basic format of interleaving reasoning and retrieval before RL
Process Reward: A reward signal given at intermediate steps (e.g., assessing document relevance) rather than just at the end
Outcome Reward: A reward signal based solely on the correctness of the final answer
Dense Retriever: A retrieval system that uses vector embeddings to find relevant documents
GAE: Generalized Advantage Estimation—a method to estimate the advantage function in RL to reduce variance
SFT: Supervised Fine-Tuning—training on labeled examples
KL divergence: A measure of difference between probability distributions, used here to prevent the RL model from deviating too far from the base model