Agentic RAG: A RAG system where the LLM dynamically controls the retrieval process, rather than following a fixed retrieve-then-generate pipeline
Process-Supervised RL: Reinforcement learning that provides feedback at intermediate steps of a reasoning chain, rather than just the final outcome
Outcome-Supervised RL: Reinforcement learning where the model is rewarded only based on the correctness of the final answer
MCTS: Monte Carlo Tree Search—a search algorithm that balances exploration and exploitation to find optimal decision paths by building a search tree
SPRE: Shortest Path Reward Estimation—a proposed reward function that values correct answers and penalizes longer reasoning chains to encourage efficiency
DPO: Direct Preference Optimization—a method to align language models to preferences without training a separate reward model, using a specific loss function on preference pairs
RAG-ProGuide: The novel dataset created by this paper, containing 13,289 process-level preference pairs derived from MCTS exploration
Gradient Conflict: A phenomenon in outcome-supervised learning where a negative final reward penalizes correct intermediate steps, causing conflicting update signals
UCB: Upper Confidence Bound—a strategy used in MCTS to select nodes that balances the estimated value of a node with the uncertainty of that estimate