MAB: Multi-Armed Bandit—a problem where an agent must choose between multiple options (arms) to maximize reward, balancing exploration and exploitation
TD(0): Temporal Difference learning (0-step)—an update rule that adjusts estimates based on the immediate next reward and current estimate
absolute advantage: The absolute value of the advantage function |A_t|; used here as a proxy for learning gain because it scales the gradient norm in policy gradient methods
GRPO: Group Relative Policy Optimization—an RL algorithm used for reasoning tasks that normalizes rewards within a group of outputs
OOD: Out-of-Distribution—test data that differs significantly from training data (e.g., harder difficulty levels)
RLOO: REINFORCE Leave-One-Out—a policy gradient estimator that uses the mean of other samples as a baseline to reduce variance
PPO: Proximal Policy Optimization—an RL algorithm that updates policies with a clipped objective to ensure stability
POMDP: Partially Observable Markov Decision Process—a framework for decision making where the system state is not fully visible
pass@1: The probability that a single model generation is correct