Spurious Rewards: Rethinking Training Signals in RLVR

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Mathematical Reasoning Training Dynamics Analysis

RLVR can significantly improve math reasoning on Qwen models even with random or incorrect rewards by amplifying pre-existing high-quality behaviors (like code reasoning) via GRPO's clipping bias.

Core Problem

The community assumes RLVR improves reasoning through verifiable feedback, but the underlying mechanism is poorly understood, as evidenced by gains from noisy or limited supervision.

Why it matters:

Current RLVR research relies heavily on Qwen models, which may exhibit unique behaviors that do not generalize to other model families like Llama or OLMo
Understanding whether RL teaches new skills or merely amplifies existing ones is crucial for designing robust post-training pipelines
Blindly applying RLVR methods validated on Qwen to other models may result in failure or performance degradation

Concrete Example: When trained with completely random rewards (noise), Qwen2.5-Math-7B improves its MATH-500 accuracy by 21.4%, nearly matching the 29.1% gain from ground truth rewards. In contrast, Llama3.1-8B-Instruct degrades by 6.4% under the same random reward conditions.

Key Novelty

Spurious Reward Elicitation & Clipping Bias Analysis

Demonstrates that 'spurious rewards' (random, incorrect, or format-only) can elicit strong performance gains in specific models (Qwen), challenging the assumption that accurate feedback is necessary for RLVR
Identifies 'clipping bias' in the GRPO objective as the mechanism: the clipping term asymmetrically favors high-probability tokens from the base model, reinforcing them even without informative rewards
Pinpoints 'code reasoning' (using Python to solve math) as the specific latent high-quality behavior in Qwen models that gets amplified by this bias

Architecture

Ablation of the clipping term in GRPO, demonstrating that clipping is the cause of learning from random rewards.

Evaluation Highlights

+21.4% absolute accuracy gain on MATH-500 for Qwen2.5-Math-7B using purely random rewards
+24.1% gain on MATH-500 for Qwen2.5-Math-7B using rewards based on incorrect labels
Code reasoning frequency in Qwen2.5-Math-7B increases from 65.0% to ~90% under spurious rewards, strongly correlating with accuracy improvements

Breakthrough Assessment

9/10

A highly counterintuitive and critical finding that challenges the foundations of RLVR. By showing that random rewards work on popular benchmarks, it forces a re-evaluation of prior success stories in reasoning alignment.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning fine-tuning of Language Models on reasoning tasks using binary rewards

Inputs: Math problems (prompts) x

Outputs: Generated reasoning chain and answer y

Pipeline Flow

Policy Model (generates multiple rollouts per prompt)
Reward Function (assigns binary score based on Ground Truth, Format, or Spurious criteria)
GRPO Update (computes advantages and updates policy with clipping)

System Modules

Policy Model

Generate reasoning steps and answers for math problems

Model or implementation: Qwen2.5-Math, Qwen2.5, Llama3, OLMo2 variants

Reward Function

Provide feedback signal to the policy

Model or implementation: Deterministic functions

Modeling

Base Model: Qwen2.5-Math-7B (primary), plus Qwen2.5-1.5B/7B, Llama3.1-8B-Instruct, Llama3.2-3B-Instruct, OLMo2-7B

Training Method: GRPO (Group Relative Policy Optimization)

Objective Functions:

Purpose: Optimize policy to maximize reward while limiting deviation from the old policy.

Formally: J(θ) = E[min(ρ * A, clip(ρ, 1-ε, 1+ε) * A)], where ρ is the importance ratio and A is the group-normalized advantage.

Training Data:

DeepScaleR dataset (math questions)

Key Hyperparameters:

clipping_threshold_epsilon: 0.2 (implied by text example)
random_reward_probability_gamma: 0.5 (default for random experiments)
group_size: 64 (for majority vote generation)
+ 1 more
training_steps: 300

Compute: Not reported in the paper

Limitations

Spurious rewards only improve models with high-quality latent priors (like Qwen); they degrade or do not help other models (Llama, OLMo).
Random rewards converge slower than ground truth rewards.
The approach is an analytical tool to understand RLVR, not a recommended practical method for training new capabilities.
Results on AIME 2025 (post-cutoff) show ground truth rewards are still superior, though spurious rewards provide some gains.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks with pass@1 or average@k evaluation

Benchmarks:

MATH-500 (Competition-level math problems)
AMC (American Mathematics Competitions)
AIME 2024 / 2025 (American Invitational Mathematics Examination)

Metrics:

Pass@1 Accuracy
Average@8 Accuracy (for AMC)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results on MATH-500 demonstrate that spurious rewards yield massive gains for Qwen2.5-Math-7B, nearly matching ground truth, but fail for Llama3.1.
MATH-500	Accuracy Gain	49.4	78.5	+29.1
MATH-500	Accuracy Gain	49.4	70.8	+21.4
MATH-500	Accuracy Gain	49.4	73.5	+24.1
MATH-500	Accuracy Gain	36.8	30.4	-6.4
MATH-500	Accuracy Gain	36.8	44.0	+7.2
Prompting interventions show that forcing 'code reasoning' improves Qwen models (which have the latent skill) but hurts others.
MATH-500	Accuracy	49.4	64.4	+15.0
MATH-500	Accuracy	36.8	15.2	-21.6

Experiment Figures

Bar chart comparing MATH-500 accuracy gains across Qwen, OLMo, and Llama models under different reward conditions (Ground Truth, Majority Vote, Incorrect, Random).

Correlation between Code Reasoning Frequency and MATH-500 Accuracy during training.

Main Takeaways

RLVR performance gains on Qwen models are largely due to eliciting latent capabilities (specifically code reasoning) rather than learning new reasoning patterns from feedback.
The GRPO clipping mechanism inherently biases training towards high-probability tokens, which explains why random rewards can reinforce dominant behaviors like code generation.
Model choice is critical: behaviors observed in Qwen (robustness to spurious rewards) do not generalize to Llama or OLMo, warning against Qwen-centric RLVR research.
Code reasoning is a powerful predictor of success in Qwen models: solutions with code have ~60.9% accuracy vs ~35.0% for natural language.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Verifiable Rewards (RLVR)
Proximal Policy Optimization (PPO) and clipping
Language Model Post-training

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—using objective correctness (e.g., math answers) as the reward signal

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of outputs for the same prompt, often used without a value network

Spurious Rewards: Rewards designed to contain little or no task-relevant signal, such as random noise or rewards for incorrect answers

Clipping Bias: A phenomenon where the clipping term in PPO/GRPO objectives asymmetrically reinforces high-probability tokens while suppressing low-probability ones, even with zero-mean advantages

Code Reasoning: The behavior of generating and mentally simulating Python code to solve math problems, without external execution