Uncalibrated Reasoning: GRPO Induces Overconfidence for Stochastic Outcomes

📝 Paper Summary

Reinforcement Learning for Reasoning Model Calibration AI for Science

Group Relative Policy Optimization (GRPO) inherently causes language models to become overconfident in stochastic domains due to biased advantage normalization, whereas PPO and RLOO remain calibrated.

Core Problem

Reinforcement learning methods like GRPO excel in deterministic domains (e.g., math) but fail in stochastic settings (e.g., scientific experiments), inducing extreme overconfidence in predicted probabilities.

Why it matters:

Scientific reasoning requires models to accurately estimate uncertainty and probabilities, not just provide binary answers
Standard RL reasoning methods effectively break model calibration, making them unreliable for high-stakes decision-making or hypothesis generation
Current trends in 'reasoning' models assume verifiable deterministic ground truth, leaving a gap for probabilistic real-world tasks

Concrete Example: In a CRISPR experiment where a gene perturbation has a 70% chance of an effect, a GRPO-trained model is driven to predict near 100% or 0%, while PPO correctly converges to the 70% probability.

Key Novelty

Bias Identification in GRPO Normalization

Identifies that the standard normalization term in GRPO's advantage estimator (dividing by group standard deviation) creates a policy-dependent bias
Demonstrates that this bias creates a feedback loop: as the policy concentrates, the normalization term amplifies the reward signal for overconfident predictions
Proposes removing group standard normalization from GRPO to restore unbiasedness and achieve calibration comparable to PPO and RLOO

Architecture

Analysis of bias in GRPO advantage estimates compared to True Advantage and No-Standardization GRPO

Evaluation Highlights

GRPO reduces Expected Calibration Error (ECE) from 0.292 (standard) to 0.036 (no normalization) on real-world CRISPR tasks using Qwen3-4B
Standard GRPO yields an AUROC of 0.69 on CRISPR data, significantly worse than PPO (0.72) and RLOO (0.72)
On synthetic data, GRPO produces extreme overconfidence (ECE 0.239) while PPO and RLOO remain perfectly calibrated (ECE < 0.005)

Breakthrough Assessment

7/10

Provides a crucial diagnostic and fix for a popular algorithm (GRPO) in a new domain (stochastic reasoning). Theoretical analysis is sound and experiments are clear, though scope is limited to calibration.

⚙️ Technical Details

Problem Definition

Setting: Probability prediction task: Given prompt q and binary answer a ~ Bernoulli(p), predict probability p that a=1

Inputs: Natural language question q (e.g., about experimental setup)

Outputs: Predicted probability scalar (tokenized percentage 1-99)

Pipeline Flow

Input Prompt (Question)
Language Model Policy (Qwen3-4B)
Sampling (Generate G responses/probabilities)
Reward Calculation (Log-likelihood against ground truth)
Advantage Estimation (GRPO/PPO/RLOO)
Policy Update

System Modules

Policy Model

Generate probability predictions given a prompt

Model or implementation: Qwen3-4B

Advantage Estimator

Calculate the relative quality of each generated response to guide optimization

Model or implementation: Mathematical Function (GRPO/PPO/RLOO)

Novel Architectural Elements

Modification of GRPO to remove group standard normalization (dividing only by 1 instead of standard deviation) to fix calibration issues

Modeling

Base Model: Qwen3-4B

Training Method: Policy Gradient (PPO, GRPO, RLOO)

Objective Functions:

Purpose: Optimize model to predict accurate probabilities.

Formally: Reward r(o,a) = a*log(p) + (1-a)*log(1-p) (Log-likelihood)
Purpose: GRPO Advantage Estimation (Standard).

Formally: A_i = (r_i - mean(r)) / (std(r) + epsilon)
Purpose: GRPO Advantage Estimation (Proposed/No Std).

Formally: A_i = r_i - mean(r)

Training Data:

Synthetic: 10,000 question-answer pairs with known probabilities
CRISPR: Perturb-seq dataset from Replogle et al. (2022), balanced positive/negative instances

Key Hyperparameters:

learning_rate: 1e-6
train_batch_size: 512
mini_batch_size: 64
+ 5 more
epochs: 16
kl_coefficient: 0.001
group_size: 4
max_response_length: 2048
ppo_clip_epsilon: Not explicitly listed for main table but 0.2 tested in ablation

Compute: Not reported in the paper

Comparison to Prior Work

vs. PPO: GRPO avoids training a separate critic model but standard normalization introduces bias in stochastic settings
vs. RLOO: GRPO (No Std) is proportional to RLOO but calculates mean over all G samples rather than G-1
vs. Dr. GRPO: This paper specifically analyzes standard normalization's effect on stochastic probability calibration, whereas Dr. GRPO focuses on general reasoning performance and length bias

Limitations

Evaluation limited to binary probability prediction tasks
Theoretical analysis assumes group size G is large, though experiments use G=4
Does not explore continuous outcome spaces
Reliance on log-likelihood reward (though Brier score is briefly analyzed in appendix)

Reproducibility

Code: https://github.com/mbereket/uncalibrated_reasoning

publicly available (https://github.com/mbereket/uncalibrated_reasoning). Code provided. Data processing steps for CRISPR are detailed in Appendix A.5. Qwen3-4B is an open model.

📊 Experiments & Results

Evaluation Setup

Predicting probabilities of binary outcomes. Task 1: Synthetic data with known ground truth. Task 2: Predicting gene perturbation effects (CRISPR).

Benchmarks:

Synthetic Probability Task (Controlled probability prediction) [New]
Replogle et al. (2022) CRISPR Screen (Scientific outcome prediction)

Metrics:

ECE (Expected Calibration Error)
AUROC (Area Under Receiver Operator Characteristic)
Accuracy (thresholded at 0.5)
Statistical methodology: 95% confidence intervals visualized in Figure 2 error bars

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Synthetic experiments demonstrate that standard GRPO fails to calibrate, while unnormalized variants succeed.
Synthetic Data	ECE (Lower is better)	0.239	0.002	-0.237
Synthetic Data	AUROC	0.75	0.82	+0.07
Real-world biological experiments confirm the synthetic findings: GRPO induces overconfidence.
CRISPR Screen	ECE (Lower is better)	0.292	0.036	-0.256
CRISPR Screen	AUROC	0.69	0.72	+0.03

Experiment Figures

Reliability diagrams (Calibration plots) for Synthetic data.

Real-world CRISPR experiment results (ECE and Reliability plots).

Main Takeaways

Removing group standard normalization from GRPO eliminates the overconfidence bias, recovering calibration performance matching PPO and RLOO.
The clipped policy gradient mechanism (from PPO) does NOT cause the miscalibration; the issue is isolated to the advantage normalization term.
Accuracy (thresholded at 0.5) is largely unaffected by the choice of algorithm, but probabilistic reliability (ECE/AUROC) is severely degraded by standard GRPO.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradients)
Language Model alignment (RLHF)
Calibration metrics (ECE)

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by normalizing rewards within a group of samples generated from the same prompt

PPO: Proximal Policy Optimization—an RL algorithm using a clipped objective and explicit value function to ensure stable policy updates

RLOO: REINFORCE Leave-One-Out—an RL algorithm that uses the mean reward of other samples in a batch as a baseline for variance reduction

ECE: Expected Calibration Error—a metric measuring the difference between predicted probabilities and actual outcome frequencies (lower is better)

CRISPR screen: A high-throughput biological experiment method used here as a stochastic testbed; models predict if a gene perturbation affects a cell phenotype

Standard Normalization: The process of subtracting the mean and dividing by the standard deviation; in GRPO, this is applied to the rewards within a group

Proper Scoring Rule: A reward function (like log-likelihood) that is mathematically maximized when the predicted probability matches the true probability