RLVR: Reinforcement Learning with Verifiable Rewards—RL using objective correctness signals (e.g., math answers) rather than human feedback
GRPO: Group Relative Policy Optimization—a prevalent RLVR method that samples a group of outputs per input and normalizes rewards within that group to reduce variance
MSSR: Multimodal Stabilized Single-Rollout—the proposed method using one rollout per input plus entropy shaping for stability
MVSR: Multimodal Vanilla Single-Rollout—a baseline single-rollout method without entropy shaping, used to demonstrate instability
entropy collapse: A failure mode where a policy becomes overly confident too quickly, losing diversity (randomness) and getting stuck in suboptimal behaviors
advantage shaping: Modifying the calculated advantage (learning signal) by adding auxiliary terms (like entropy) to guide optimization
Beta distribution: A probability distribution defined on the interval [0, 1], used here to estimate the expected probability of getting a correct reward
KL divergence: Kullback-Leibler divergence—a metric measuring how much one probability distribution differs from another, used here to track policy change