GPO: Guided Pivotal Optimization—the proposed fine-tuning strategy that focuses learning on critical steps
Advantage function: In RL, a measure of how much better a specific action is compared to the average action at that state; used here to find the most important reasoning step
Critical step: The specific step in a reasoning chain with the highest advantage value, indicating it is the pivotal moment for solving the problem
PPO: Proximal Policy Optimization—an online RL algorithm that updates policies while preventing drastic changes
DPO: Direct Preference Optimization—an offline method aligning models to preferences without an explicit reward model
Monte Carlo (MC) estimation: A method to estimate values (like Q-values) by averaging the results of many random simulations
Satori: A related method that uses random resets in reasoning chains; GPO improves on this by using targeted resets
Q-value: The expected total future reward of taking a specific action in a specific state
Concentrability: A theoretical measure of the mismatch between the optimal policy's state distribution and the current policy's distribution