← Back to Paper List

BNPO: Beta Normalization Policy Optimization

Changyi Xiao, Mengdi Zhang, Yixin Cao
School of Computer Science, Fudan University
arXiv.org (2025)
RL Reasoning

📝 Paper Summary

Reinforcement Learning for LLMs Policy Optimization
BNPO dynamically normalizes binary rewards using an adaptive Beta distribution that evolves alongside the policy, reducing gradient variance more effectively than static methods.
Core Problem
Current policy optimization methods like REINFORCE and GRPO use either no normalization or static normalization strategies that fail to adapt to the changing distribution of rewards as the policy updates.
Why it matters:
  • Fixed normalization cannot track the dynamic nature of policy updates, leading to unstable gradient estimates.
  • High variance in gradient estimation hinders training stability and convergence, particularly in reasoning tasks with sparse binary rewards.
  • Methods like DeepSeek-R1 rely on binary rule-based rewards, making efficient optimization in this setting critical for reasoning capabilities.
Concrete Example: In a reasoning task where a model initially gets 10% correct, a fixed normalization might be appropriate. However, as the model improves to 90% accuracy, the distribution of expected rewards shifts drastically. Static methods like GRPO use batch-level statistics that may fluctuate wildly or fail to capture the global shift, whereas BNPO adjusts its Beta distribution parameters to match the evolving probability of success.
Key Novelty
Beta-Distribution Adaptive Reward Normalization
  • Models the expected binary reward (success probability) as a Beta distribution that updates dynamically during training.
  • Derives an optimal parameter setting for this Beta distribution that theoretically minimizes policy gradient variance.
  • Generalizes existing methods: REINFORCE and GRPO are shown to be special cases of BNPO with fixed Beta parameters.
Evaluation Highlights
  • Achieves state-of-the-art performance among policy optimization methods on reasoning tasks (specific metric gains not explicitly tabulated in provided text, but claimed SOTA).
  • Theoretically proven to minimize gradient variance when parameters α and β are set to specific values derived from method-of-moments estimation.
Breakthrough Assessment
7/10
Strong theoretical grounding connecting RL to Beta distributions with a generalization of GRPO/REINFORCE. Practical impact depends on the magnitude of empirical gains, which are claimed but not detailed numerically in the provided snippet.
×