HERO: Hybrid Ensemble Reward Optimization—the proposed framework combining rule-based and model-based rewards
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of sampled outputs for the same prompt to reduce variance
RLVR: Reinforcement Learning with Verifiable Rewards—using deterministic checkers (like code execution or string match) as reward signals
Reward Model: A trained neural network that predicts a scalar quality score for a response, providing dense feedback compared to sparse binary verifiers
Stratified Normalization: A technique to rescale continuous reward scores into disjoint ranges (e.g., [0, α] and [β, 1]) based on a binary verifier's output
Variance-Aware Weighting: A mechanism to assign higher training weights to prompts where the model generates diverse-quality responses (high variance in scores)
False Negative: When a correct answer is rejected by the verifier (e.g., due to formatting), receiving a 0 reward incorrectly