Long-CoT: Long Chain-of-Thought—generating extremely detailed, step-by-step reasoning paths (sometimes thousands of tokens) to solve complex problems
Partial Rollout: A training technique where new trajectories are sampled by reusing large chunks of previous trajectories stored in a buffer, avoiding the cost of re-generating the full history
Online Mirror Descent: An optimization algorithm that updates policies by keeping them close to a reference distribution while maximizing rewards, often used for stable RL
SFT: Supervised Fine-Tuning—training a model on labeled examples before applying reinforcement learning
Reward Hacking: When an RL agent exploits loopholes in the reward function (e.g., guessing answers without reasoning) to maximize score without learning the intended task
Process Reward Model: A reward model that evaluates intermediate steps of reasoning rather than just the final outcome
Monte Carlo Tree Search (MCTS): A search algorithm used in decision processes to explore future states; often used in RL but replaced here by long-context implicit search
DPO: Direct Preference Optimization—a method to align models using preference pairs without an explicit reward model
Rejection Sampling: Generating multiple samples from a model and keeping only those that meet a correctness criterion
Value Function: A function estimating the expected future reward from a current state; explicitly removed in this paper's framework