ZeroTIR: Zero-shot Tool-Integrated Reasoning—training a model to use tools via RL without supervised examples
ZeroRL: Reinforcement Learning applied directly to base models (without SFT) using simple outcome-based rewards
SFT: Supervised Fine-Tuning—training models on labeled examples of inputs and desired outputs
PPO: Proximal Policy Optimization—an RL algorithm that updates policies with a clipped objective to ensure stability
Reinforce++: A variant of the REINFORCE algorithm that improves stability and performance for LLM reasoning tasks
GAE: Generalized Advantage Estimation—a method to estimate the advantage of an action by balancing bias and variance
Outcome-based reward: A binary reward signal given only at the end of a task (1 for correct answer, 0 for incorrect), as opposed to step-by-step process rewards
Pass@k: The probability that at least one of the top k generated solutions is correct