GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes policies by comparing a group of sampled outputs against their group average, removing the need for a critic network
U-statistic: A class of statistical estimators that generalize the sample mean to averages over functions (kernels) of multiple random variables
Hoeffding decomposition: A statistical technique that breaks a U-statistic into orthogonal components (linear and higher-order), often used to prove asymptotic normality
Oracle policy gradient: A theoretical ideal algorithm that computes gradients using the true (unknown) value function as a baseline
Suboptimality gap: The difference in expected reward between the policy learned by the algorithm and the theoretically optimal policy
MSE: Mean Squared Error—a measure of the quality of an estimator (here, the gradient estimator)
PPO: Proximal Policy Optimization—a standard RL algorithm that typically uses a learned critic network to reduce gradient variance
RLVR: Reinforcement Learning with Verifiable Rewards—a post-training method where rewards are objective (e.g., math solution is correct) rather than learned from human preference
Critic network: In Actor-Critic RL, a neural network that estimates the value (expected future reward) of a state to guide the actor's updates