Evaluation Setup
Policy optimization on reasoning tasks using binary outcomes (correct/incorrect)
Benchmarks:
- Reasoning Tasks (Mathematical/Logical Reasoning (implied by DeepSeek-R1 context))
Metrics:
- Gradient Variance
- Training Stability
- Statistical methodology: Not explicitly reported in the paper
Main Takeaways
- BNPO reduces the variance of policy gradient estimates compared to static normalization methods.
- The method theoretically generalizes REINFORCE (α=β=1) and GRPO (α=β=1.5) under binary reward settings.
- An advantage decomposition mechanism allows the method to extend to complex reward systems (e.g., format + accuracy rewards).
- Dynamic adjustment of normalization parameters aligns better with the evolving policy distribution during training.