← Back to Paper List

Rate or Fate? RLV$^\varepsilon$R: Reinforcement Learning with Verifiable Noisy Rewards

Ali Rad, Khashayar Filom, Darioush Keivan, Peyman Mohajerin Esfahani, Ehsan Kamalinejad
Cognichip.ai
arXiv (2026)
RL Reasoning Benchmark

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Noisy Supervision in RL
This paper analytically proves that in group-normalized reinforcement learning, verifier noise primarily slows down convergence ('rate') rather than preventing it ('fate'), provided the verifier's Youden Index remains positive.
Core Problem
Verifiers (unit tests, LLM judges) used in Reinforcement Learning with Verifiable Rewards (RLVR) are inherently noisy, suffering from False Positives and False Negatives.
Why it matters:
  • Imperfect tests in coding domains (sparse unit tests) can uncouple rewards from functional correctness, potentially leading to model collapse
  • It is unknown whether verification noise simply slows learning or actively reverses it, causing the model to optimize for incorrect behaviors
  • Existing methods relying on AI feedback (RLAIF) or self-rewards are vulnerable to systematic bias and reward hacking
Concrete Example: In coding tasks, a solution might pass a weak test suite but be functionally incorrect (False Positive). Conversely, a correct solution might fail a flaky test (False Negative). If the False Positive Rate exceeds the True Positive Rate, the RL algorithm might actively learn to produce buggy code that satisfies the weak tests.
Key Novelty
Multi-Armed Bandit View of RLVR Dynamics via Youden's Index
  • Models the learning dynamics of Group Relative Policy Optimization (GRPO) as a replicator process (natural selection) on the probability simplex
  • Identifies Youden's Index (J = TPR - FPR) as the singular 'coefficient of friction' that determines the direction and speed of learning
  • Demonstrates that noise simply rescales the time variable: a noisy environment requires 1/J times more steps to reach the same accuracy as a clean one
Evaluation Highlights
  • Identified a sharp phase transition at Youden's Index J=0: learning succeeds strictly when J > 0, is neutral at J=0, and collapses when J < 0
  • Derived the exact time-rescaling law: noisy dynamics with index J converge to the same solution as noise-free dynamics but scaled by a factor of 1/J
  • Proved that noise-free GRPO error decays asymptotically at a rate of t^-2, while noisy regimes follow the same trajectory slowed by the noise level
Breakthrough Assessment
8/10
Provides a fundamental theoretical framework solving the stability question for noisy RLVR. The 'Rate vs. Fate' distinction and the J=0 phase transition offer a crisp analytical lens for future RL research.
×