MDP: Markov Decision Process—a mathematical framework for modeling decision-making where outcomes are partly random and partly under the control of a decision maker
Process Supervision: Training models using feedback on intermediate reasoning steps rather than just the final result
Outcome Supervision: Training models based solely on whether the final answer is correct
DPO: Direct Preference Optimization—a method to align language models to preferences without training a separate reward model
SFT: Supervised Fine-Tuning—training the model on high-quality examples before applying RL
Rollout: Simulating the completion of a task from a certain state to estimate the future reward
Pruning: Removing unpromising branches in a search tree to save computation
Search-R1: A strong baseline method using outcome-supervised reinforcement learning for RAG
F1 score: A metric measuring the overlap between the predicted answer and the ground truth
EM: Exact Match—a metric requiring the predicted answer to be identical to the ground truth
HotpotQA: A dataset for multi-hop question answering requiring reasoning over multiple documents