← Back to Paper List

Noisy Data is Destructive to Reinforcement Learning with Verifiable Rewards

Yuxuan Zhu, Daniel Kang
arXiv (2026)
RL Reasoning Benchmark

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Data Quality in Post-training
The paper refutes the hypothesis that RLVR is robust to reward noise by demonstrating that prior findings relied on contaminated datasets and that truly noisy data significantly degrades reasoning performance.
Core Problem
Recent studies incorrectly claim that LLMs can learn effective reasoning from 100% incorrect data, leading to the dangerous assumption that data quality is secondary to algorithmic design.
Why it matters:
  • Misleads the field into underinvesting in high-quality verifiable data curation
  • Encourages reliance on flawed 'robust' algorithms that fail in real-world noisy scenarios
  • Obscures the true failure modes of RLVR, which collapses to simple format adherence under severe noise
Concrete Example: In prior datasets, a math problem might have a ground truth '1/2'. If the model outputs '0.5', a weak verifier marks it 'incorrect' (noise). However, since '0.5' is actually correct, the model learns correct reasoning despite the 'incorrect' label. The authors show this 'contamination' inflated prior robustness claims.
Key Novelty
Empirical invalidation of the 'Noise Robustness Hypothesis' in RLVR
  • Identifies that 'noisy' datasets in prior work contained >16% correct answers (false negatives), creating a false signal of robustness
  • Constructs a rigorously verified 'truly noisy' dataset using GPT-5 Pro and symbolic verification to test actual noise tolerance
  • Demonstrates that under true noise, RLVR performance collapses to that of models trained only to follow output formats (e.g., boxing answers)
Evaluation Highlights
  • Training on truly 100% incorrect annotations degrades MATH-500 accuracy by 9% compared to training on clean data, contradicting prior claims of <5% loss
  • Real-world annotation errors in the BIRD Text2SQL dataset reduce accuracy by 5–12% compared to a manually corrected clean version
  • State-of-the-art noise mitigation algorithms (adaptive clipping, dynamic sampling) fail to recover performance, lagging behind standard GRPO on clean data by over 3%
Breakthrough Assessment
8/10
Crucial correction to the field's understanding of RLVR. By debunking the 'noise is fine' myth with rigorous data analysis, it redirects focus back to the necessity of high-quality data.
×