GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same prompt to reduce variance, but risks gradient vanishing if intra-group variance is low.
VAS: Variance-Aware Sampling—the proposed strategy to sample prompts that maximize expected reward variance.
VPS: Variance Promotion Score—a metric combining outcome variance and trajectory diversity to guide sampling.
OVS: Outcome Variance Score—component of VPS measuring the variance of correctness (Bernoulli variance), maximized when pass rate is 0.5.
TDS: Trajectory Diversity Score—component of VPS measuring diversity of reasoning paths (e.g., inverse self-BLEU), providing a lower bound on variance.
CoT: Chain-of-Thought—a reasoning method where the model generates intermediate steps before the final answer.
Self-BLEU: A metric measuring diversity by calculating BLEU scores between generated sequences; lower Self-BLEU implies higher diversity.
Pass rate: The fraction of generated responses for a given prompt that are correct.
Gradient vanishing: In this context, the phenomenon where policy gradient updates approach zero because the advantage function (reward minus baseline) becomes zero when all rewards in a group are identical.