Noisy Data is Destructive to Reinforcement Learning with Verifiable Rewards

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Data Quality in Post-training

The paper refutes the hypothesis that RLVR is robust to reward noise by demonstrating that prior findings relied on contaminated datasets and that truly noisy data significantly degrades reasoning performance.

Core Problem

Recent studies incorrectly claim that LLMs can learn effective reasoning from 100% incorrect data, leading to the dangerous assumption that data quality is secondary to algorithmic design.

Why it matters:

Misleads the field into underinvesting in high-quality verifiable data curation
Encourages reliance on flawed 'robust' algorithms that fail in real-world noisy scenarios
Obscures the true failure modes of RLVR, which collapses to simple format adherence under severe noise

Concrete Example: In prior datasets, a math problem might have a ground truth '1/2'. If the model outputs '0.5', a weak verifier marks it 'incorrect' (noise). However, since '0.5' is actually correct, the model learns correct reasoning despite the 'incorrect' label. The authors show this 'contamination' inflated prior robustness claims.

Key Novelty

Empirical invalidation of the 'Noise Robustness Hypothesis' in RLVR

Identifies that 'noisy' datasets in prior work contained >16% correct answers (false negatives), creating a false signal of robustness
Constructs a rigorously verified 'truly noisy' dataset using GPT-5 Pro and symbolic verification to test actual noise tolerance
Demonstrates that under true noise, RLVR performance collapses to that of models trained only to follow output formats (e.g., boxing answers)

Evaluation Highlights

Training on truly 100% incorrect annotations degrades MATH-500 accuracy by 9% compared to training on clean data, contradicting prior claims of <5% loss
Real-world annotation errors in the BIRD Text2SQL dataset reduce accuracy by 5–12% compared to a manually corrected clean version
State-of-the-art noise mitigation algorithms (adaptive clipping, dynamic sampling) fail to recover performance, lagging behind standard GRPO on clean data by over 3%

Breakthrough Assessment

8/10

Crucial correction to the field's understanding of RLVR. By debunking the 'noise is fine' myth with rigorous data analysis, it redirects focus back to the necessity of high-quality data.

⚙️ Technical Details

Problem Definition

Setting: Post-training LLMs using Reinforcement Learning with verifiable rewards under noisy annotation conditions

Inputs: Prompt x (e.g., math question or SQL query request)

Outputs: Answer y with reasoning trace

Pipeline Flow

Data Curation (Re-verification Pipeline)
RLVR Training (GRPO)

System Modules

Data Re-verifier

Identify and remove correct answers disguised as noise in the dataset

Model or implementation: Pipeline: GPT-5 Pro + math-verify + Manual Review

Policy Model (RLVR Training)

Generate reasoning traces and answers

Model or implementation: Qwen2.5-Math-7B or DeepSeek-V3

Reward Provider (RLVR Training)

Calculate binary rewards based on annotations

Model or implementation: Deterministic Verifier

Modeling

Base Model: Qwen2.5-Math-7B (primary), DeepSeek-V3 (large scale validation)

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Optimize policy to maximize expected reward while staying close to reference model.

Formally: Maximize E [ min( ratio * A, clip(ratio, 1-eps, 1+eps) * A ) - beta * D_KL ]

Training Data:

DeepScaleR dataset (Math)
BIRD dataset (Text2SQL)
Re-verified subsets of the above

Key Hyperparameters:

computational_requirements: 4x NVIDIA A100 GPUs (for 7B models)

Compute: 4x NVIDIA A100 GPUs (for 7B models)

Reproducibility

The paper describes the data re-verification pipeline in detail, including the use of GPT-5 Pro and manual steps. Code for the pipeline or the specific purified datasets is not explicitly linked in the provided text.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning and Text2SQL generation under various noise conditions

Benchmarks:

MATH-500 (Mathematical Reasoning)
AIME (Mathematical Reasoning (Competition Level))
BIRD (Text-to-SQL)

Metrics:

Accuracy (Pass@1)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results on mathematical reasoning showing the impact of truly noisy data vs. clean data.
MATH-500	Accuracy	Not reported in the paper	Not reported in the paper	-9.0%
BIRD (Text2SQL)	Accuracy	Not reported in the paper	Not reported in the paper	-12.0%
MATH-500	Accuracy	Not reported in the paper	Not reported in the paper	-3.0%

Experiment Figures

Comparison of model performance on MATH-500 when trained on Clean, Contaminated Noisy (prior work), and Truly Noisy data

Examples of contamination in prior noisy datasets

Main Takeaways

Prior 'noisy' datasets were contaminated: >16% of 'incorrect' labels were actually correct, inflating robustness claims.
True noise is destructive: When trained on rigorously incorrect data, models perform 8-10% worse than clean baselines, similar to models trained only on format rewards.
Algorithms cannot fix bad data: Advanced methods like bias mitigation and adaptive clipping fail to compensate for true annotation noise.
Real-world relevance: Findings extend to Text2SQL (BIRD), where natural human errors cause significant performance drops (up to 12%), proving this is not just a synthetic artifact.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Verifiable Rewards (RLVR)
Proximal Policy Optimization (PPO) or Group Relative Policy Optimization (GRPO)
Outcome-based supervision

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—using objective correctness checks (like compiling code or solving math) as reward signals

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of sampled outputs for the same prompt to reduce variance

Text2SQL: The task of converting natural language questions into executable SQL database queries

BIRD: A large-scale, cross-domain dataset for Text2SQL parsing known for containing real-world noise

Gold Annotations: The ground-truth answers or labels provided in a dataset

Symbolic Equivalence: Verifying if two mathematical expressions represent the same value despite different forms (e.g., 1/2 vs 0.5)

LLM-as-a-Judge: Using a strong Large Language Model (like GPT-5) to evaluate the correctness of outputs from a smaller model

Format Reward: A reward signal given solely for adhering to a specific output structure (e.g., putting the answer in \boxed{}) regardless of correctness