RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents
Agentic RAG: A RAG system where the LLM autonomously decides when to search, what to query, and when to stop, rather than following a fixed pipeline
Process-supervised RL: Reinforcement learning that provides feedback at each intermediate step of reasoning, rather than just for the final result
MCTS: Monte Carlo Tree Search—a search algorithm that builds a decision tree by randomly simulating future outcomes to find optimal moves
DPO: Direct Preference Optimization—a method that aligns language models to preferences by optimizing on paired examples (winner vs. loser) without a separate reward model
SPRE: Shortest Path Reward Estimation—a novel reward function in this paper that favors trajectories yielding correct answers in fewer steps
Outcome-supervised RL: RL where the model only receives a reward signal (positive/negative) after generating the complete final answer
Rollout: A simulation in MCTS where the model continues generating from a specific state to the end to estimate the value of that state
UCB: Upper Confidence Bound—a formula used in search algorithms to balance exploring new uncertain paths vs. exploiting known good paths
F1 score: A metric measuring the overlap between the predicted answer and the ground truth answer