← Back to Paper List

Sharpness-Guided Group Relative Policy Optimization via Probability Shaping

Tue Le, Linh Ngo Van, Trung Le
arXiv (2025)
RL Reasoning Agent

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Large Language Model Reasoning Generalization in RL
GRPO-SG stabilizes reasoning model training by reweighting token updates based on their generation probability, reducing gradient sharpness and improving generalization without requiring a separate value network.
Core Problem
Standard Group Relative Policy Optimization (GRPO) treats all tokens equally, leading to 'sharp' updates (large gradient norms) on unstable tokens that degrade generalization.
Why it matters:
  • RL training for reasoning is notoriously unstable; uncontrolled gradients can cause policy collapse or overfitting to specific logic paths
  • Reasoning models need to generalize to new math/logic problems, but standard RLVR often overfits to the training set's specific verification rules
  • Existing solutions like PPO require expensive value models; GRPO is efficient but lacks mechanisms to explicitly control update sharpness/flatness
Concrete Example: In a logic puzzle, a model might guess the correct answer via a low-probability token sequence. Standard GRPO would aggressively reinforce this 'lucky' spike, causing a massive gradient update (high sharpness). GRPO-SG detects the low probability and downweights this update, preventing the model from overfitting to the noisy signal.
Key Novelty
Sharpness-Guided GRPO (GRPO-SG)
  • Theoretically links RLVR generalization error to 'sharpness' (measured by gradient norm), proposing that flatter minima generalize better
  • Introduces a token-level weight $w_{i,t}$ derived from the model's own output probability to regulate update magnitude
  • Downweights tokens that would cause exploding gradients (reducing sharpness) while preserving signal for confident, semantically critical tokens
Evaluation Highlights
  • +61.5% improvement in average accuracy on K&K Logic Puzzles (0.39 -> 0.63) for Qwen2.5-3B compared to standard GRPO
  • Nearly doubles Exact Match accuracy on agentic QA tasks (13.84 -> 27.29) for Qwen2.5-3B
  • +14.4% absolute gain on AIME 2024 (avg@16) for DeepScaleR (28.89 -> 43.33) using GRPO-SG
Breakthrough Assessment
7/10
A theoretically grounded yet simple modification to a popular algorithm (GRPO). Strong empirical gains across diverse reasoning tasks (Math, Logic, Agentic) suggest high practical utility, though the core mechanism is a weighting heuristic.
×