Sharpness-Guided Group Relative Policy Optimization via Probability Shaping

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Large Language Model Reasoning Generalization in RL

GRPO-SG stabilizes reasoning model training by reweighting token updates based on their generation probability, reducing gradient sharpness and improving generalization without requiring a separate value network.

Core Problem

Standard Group Relative Policy Optimization (GRPO) treats all tokens equally, leading to 'sharp' updates (large gradient norms) on unstable tokens that degrade generalization.

Why it matters:

RL training for reasoning is notoriously unstable; uncontrolled gradients can cause policy collapse or overfitting to specific logic paths
Reasoning models need to generalize to new math/logic problems, but standard RLVR often overfits to the training set's specific verification rules
Existing solutions like PPO require expensive value models; GRPO is efficient but lacks mechanisms to explicitly control update sharpness/flatness

Concrete Example: In a logic puzzle, a model might guess the correct answer via a low-probability token sequence. Standard GRPO would aggressively reinforce this 'lucky' spike, causing a massive gradient update (high sharpness). GRPO-SG detects the low probability and downweights this update, preventing the model from overfitting to the noisy signal.

Key Novelty

Sharpness-Guided GRPO (GRPO-SG)

Theoretically links RLVR generalization error to 'sharpness' (measured by gradient norm), proposing that flatter minima generalize better
Introduces a token-level weight $w_{i,t}$ derived from the model's own output probability to regulate update magnitude
Downweights tokens that would cause exploding gradients (reducing sharpness) while preserving signal for confident, semantically critical tokens

Evaluation Highlights

+61.5% improvement in average accuracy on K&K Logic Puzzles (0.39 -> 0.63) for Qwen2.5-3B compared to standard GRPO
Nearly doubles Exact Match accuracy on agentic QA tasks (13.84 -> 27.29) for Qwen2.5-3B
+14.4% absolute gain on AIME 2024 (avg@16) for DeepScaleR (28.89 -> 43.33) using GRPO-SG

Breakthrough Assessment

7/10

A theoretically grounded yet simple modification to a popular algorithm (GRPO). Strong empirical gains across diverse reasoning tasks (Math, Logic, Agentic) suggest high practical utility, though the core mechanism is a weighting heuristic.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning with Verifiable Rewards (RLVR) for LLM reasoning

Inputs: Prompt $q$ (e.g., math problem, logic puzzle)

Outputs: Reasoning chain and final answer $o = (o_1, ..., o_T)$

Pipeline Flow

Prompt Input
Policy Model (Generation)
Verifier (Reward Calculation)
Probability Reweighting (Training only)

System Modules

Policy Model

Generate reasoning steps and final answer autoregressively

Model or implementation: Various LLMs (Qwen2.5, DeepScaleR, Open Reasoner-Zero)

Novel Architectural Elements

Probability-Shaped Token Weighting: Incorporates a dynamic weight $w_{i,t}$ into the GRPO loss function dependent on $\pi_\theta(o_{i,t})$

Modeling

Base Model: Qwen2.5-3B, Qwen2.5-7B, Qwen3-4B, DeepScaleR (DSR), Open Reasoner-Zero (ORZ)

Training Method: Sharpness-Guided Group Relative Policy Optimization (GRPO-SG)

Objective Functions:

Purpose: Maximize reward while minimizing gradient sharpness and staying close to reference model.

Formally: $J(\theta) = \mathbb{E}[\frac{1}{G} \sum w_{i,t} \cdot \min(r_{i,t} \hat{A}_{i,t}, \text{clip}(\dots)) - \beta D_{KL}]$

Training Data:

Math: OlympiadBench, Minerva, MATH-500, AMC, AIME
Agentic: HotpotQA, 2Wiki, MuSiQue
Logic: K&K Logic Puzzles (synthetic instances)

Key Hyperparameters:

computational_requirements: Slightly higher training time than GRPO (~5-6% increase) due to weight calculation; same peak memory usage.

Compute: Qwen2.5-3B on K&K: 286 mins/sample (GRPO-SG) vs 271.4 (GRPO). Peak Mem: ~460GB. (Cluster details not specified)

Comparison to Prior Work

vs. GRPO: Adds probability-based token weighting $w_{i,t}$ to the surrogate objective
vs. SAM [not cited in paper as baseline]: Achieves sharpness reduction via reweighting rather than explicit min-max optimization steps

Limitations

Incurs a small computational overhead (approx 5-6% time) to compute token weights
Relies on a heuristic connection between probability and gradient sharpness rather than calculating the Hessian directly
Hyperparameters for the weighting function add complexity to the tuning process

Reproducibility

Code availability is 'not provided'. Hyperparameters for the weighting function (alpha, tau, mu) are referenced as being in Appendix B, but the provided text ends before the Appendix, making exact replication impossible from this text alone.

📊 Experiments & Results

Evaluation Setup

RLVR on Math, Agentic QA, and Logic Puzzles

Benchmarks:

OlympiadBench (Mathematical Reasoning)
AIME 2024 (Mathematical Reasoning)
VerlTool (HotpotQA, etc.) (Agentic Tool-Use QA)
K&K Logic Puzzles (Logic Puzzle Solving)

Metrics:

Accuracy (Greedy)
Pass@16
Avg@16 (Mean accuracy across 16 samples)
Exact Match (EM)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results on mathematical reasoning benchmarks show consistent improvement over GRPO, particularly on harder tasks like AIME.
AIME 2024	Avg@16	28.89	43.33	+14.44
OlympiadBench	Accuracy	36.50	38.48	+1.98
MATH-500	Accuracy	74.67	79.40	+4.73
Agentic reasoning tasks show the largest relative gains, suggesting sharpness control helps significantly in tool-use/multi-hop settings.
HotpotQA (VerlTool)	Exact Match	14.50	24.50	+10.00
Agentic QA (Average)	Exact Match	13.84	27.29	+13.45
Logic puzzles show that GRPO-SG prevents collapse where standard GRPO fails on complex reasoning chains.
K&K Logic Puzzles	Average Accuracy	0.39	0.63	+0.24
K&K Logic Puzzles	Average Accuracy	0.77	0.91	+0.14

Experiment Figures

Gradient norm trajectories during training for GRPO vs GRPO-SG across three settings.

Training reward trajectories for GRPO vs GRPO-SG.

Main Takeaways

GRPO-SG consistently outperforms standard GRPO across Math, Logic, and Agentic domains, with gains ranging from +6% to +97%.
Gradient norm analysis (Figure 1) confirms GRPO-SG results in smoother trajectories and fewer outlier spikes, validating the sharpness reduction hypothesis.
The method is particularly effective in 'Agentic' and 'Logic' settings where reasoning chains are fragile and sharp updates can easily derail the policy.
Computational overhead is minimal (weights derived from existing probabilities), making it a practical drop-in replacement for GRPO.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradient methods)
KL Divergence
Gradient Norm / Sharpness-Aware Minimization (SAM) concepts

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—training LLMs using binary pass/fail signals (like unit tests or math answers) rather than human preference models

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a sample's reward to the group average of multiple samples for the same prompt, avoiding a separate value function

Sharpness: A measure of how quickly the loss increases as parameters define from a local minimum; effectively measured here by the gradient norm

Gradient Norm: The magnitude of the gradient vector; in this paper, used as a proxy for sharpness (larger norm = sharper/less stable)

Pass@k: A metric counting a problem as solved if at least one of $k$ generated solutions is correct

Token-level weight: A scalar applied to the loss of a specific token during training to increase or decrease its influence on the gradient update