RLVR: Reinforcement Learning with Verifiable Rewards—training method where rewards are binary (correct/incorrect) based on a rule-based checker (e.g., for math or code).
R1-Zero-style training: A paradigm introduced by DeepSeek-R1-Zero that uses RLVR to induce reasoning capabilities (Chain of Thought) without supervised demonstrations.
CoT: Chain of Thought—intermediate reasoning steps generated by the model before the final answer.
GRPO: Group Relative Policy Optimization—a simplified RL algorithm (variant of PPO) used in DeepSeek-R1-Zero that normalizes advantages within a group of samples.
RLOO: Reinforce Leave-One-Out—a variance reduction technique for policy gradients where the baseline for a sample is the average reward of other samples in the same batch.
Rao-Blackwellization: A statistical technique to reduce the variance of an estimator by taking its expectation conditioned on a sufficient statistic (here, marginalizing out the final answer y).
JLB: Jensen's Lower Bound—a variational lower bound objective used in prior work like Tang et al. [40] for latent reasoning.
LaTRO: Latent Reasoning Optimization—another variational approach (Chen et al. [4]) using a fixed reference policy.
Policy Gradient: An optimization technique where the model's parameters are updated to increase the probability of actions that yield high rewards.
PPO: Proximal Policy Optimization—a standard RL algorithm that prevents the policy from changing too drastically in one step.