Evaluation Setup
Predicting probabilities of binary outcomes. Task 1: Synthetic data with known ground truth. Task 2: Predicting gene perturbation effects (CRISPR).
Benchmarks:
- Synthetic Probability Task (Controlled probability prediction) [New]
- Replogle et al. (2022) CRISPR Screen (Scientific outcome prediction)
Metrics:
- ECE (Expected Calibration Error)
- AUROC (Area Under Receiver Operator Characteristic)
- Accuracy (thresholded at 0.5)
- Statistical methodology: 95% confidence intervals visualized in Figure 2 error bars
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Synthetic experiments demonstrate that standard GRPO fails to calibrate, while unnormalized variants succeed. |
| Synthetic Data |
ECE (Lower is better) |
0.239 |
0.002 |
-0.237
|
| Synthetic Data |
AUROC |
0.75 |
0.82 |
+0.07
|
| Real-world biological experiments confirm the synthetic findings: GRPO induces overconfidence. |
| CRISPR Screen |
ECE (Lower is better) |
0.292 |
0.036 |
-0.256
|
| CRISPR Screen |
AUROC |
0.69 |
0.72 |
+0.03
|
Main Takeaways
- Removing group standard normalization from GRPO eliminates the overconfidence bias, recovering calibration performance matching PPO and RLOO.
- The clipped policy gradient mechanism (from PPO) does NOT cause the miscalibration; the issue is isolated to the advantage normalization term.
- Accuracy (thresholded at 0.5) is largely unaffected by the choice of algorithm, but probabilistic reliability (ECE/AUROC) is severely degraded by standard GRPO.