The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) LLM Reasoning Inference Scaling

Decomposing RLVR reveals that solely penalizing incorrect reasoning paths (NSR) improves inference scaling and diversity more effectively than reinforcing correct ones, which tends to collapse the solution space.

Core Problem

Standard RLVR blends signals from correct and incorrect responses, but the specific mechanisms driving performance are unclear; reinforcing correct answers often leads to mode collapse, hurting performance at higher compute budgets (Pass@k).

Why it matters:

Models trained with standard RL (PPO/GRPO) often lose performance advantages at large sampling budgets (high k) due to reduced diversity
Understanding whether learning comes from 'knowing what is right' vs 'knowing what is wrong' is critical for designing better reasoning objectives
Solely reinforcing correct paths (Positive Sample Reinforcement) creates overconfident models that fail to explore valid alternative reasoning strategies

Concrete Example: When a model is trained only on correct samples (PSR), it might learn to memorize one specific solution path for a math problem. During testing, if allowed 256 attempts, it mostly repeats that single path. If that path is wrong for a slight variation, the model fails. In contrast, a model trained to avoid errors (NSR) suppresses known bad paths but keeps the rest of its distribution open, finding the correct answer through diverse valid attempts.

Key Novelty

Decomposition of RLVR into Positive (PSR) and Negative Sample Reinforcement (NSR)

Decomposes the RLVR objective into two separate components: maximizing likelihood of correct responses (PSR) and minimizing likelihood of incorrect responses (NSR)
Demonstrates that NSR alone—learning only from mistakes—is sufficient to match or beat full PPO/GRPO baselines on inference scaling metrics
Shows via gradient analysis that NSR works by 'pruning' incorrect paths while redistributing probability mass to other plausible priors, preserving diversity unlike PSR

Architecture

Decomposition of RLVR into Positive Sample Reinforcement (PSR) and Negative Sample Reinforcement (NSR) learning paradigms.

Evaluation Highlights

NSR matches thinking-mode performance on MATH using a non-thinking base model: 94.0 Pass@1 vs 94.5 (Target) and 98.0 Pass@64 vs 97.8 (Target)
NSR consistently outperforms Positive Sample Reinforcement (PSR) on Pass@k for k > 8, avoiding the diversity collapse observed in PSR
Proposed Weighted-REINFORCE (upweighting NSR) consistently improves over strong baselines like PPO and GRPO on MATH, AIME 2025, and AMC23

Breakthrough Assessment

8/10

Provides a fundamental insight into *why* RLVR works (negative signal is more critical for scaling than positive signal) and offers a simpler, more effective training paradigm.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement learning on reasoning tasks where outcomes are binary and verifiable (correct/incorrect)

Inputs: Prompt x from dataset D

Outputs: Reasoning chain and final answer y

Pipeline Flow

Prompt Generation (Sample x)
Response Generation (Policy generates y)
Verification (Function r(x,y) checks correctness)
Loss Calculation (PSR, NSR, or Weighted-REINFORCE update)

System Modules

Policy Model

Generates reasoning traces and answers

Model or implementation: Qwen2.5-Math-7B, Qwen3-4B, or Llama-3.1-8B-Instruct

Verifier

Determines binary reward

Model or implementation: Deterministic Function

Novel Architectural Elements

Splitting the loss function into explicit Positive (PSR) and Negative (NSR) terms and upweighting the NSR term

Modeling

Base Model: Qwen2.5-Math-7B, Qwen3-4B, Llama-3.1-8B-Instruct

Training Method: REINFORCE (modified)

Objective Functions:

Purpose: Reinforce correct samples (PSR).

Formally: L_PSR = - E[r(x,y) * log p(y|x)] where r(x,y)=1
Purpose: Penalize incorrect samples (NSR).

Formally: L_NSR = - E[r(x,y) * log p(y|x)] where r(x,y)=-1
Purpose: Weighted combination.

Formally: Upweighting L_NSR contribution in the total gradient

Training Data:

MATH dataset (7,500 problems) for training

Key Hyperparameters:

prompt_batch_size: 1024
rollouts_per_prompt: 8
mini_batch_size: 256
+ 3 more
learning_rate: 1e-6
training_temperature: 1.0
max_context_length: 4096 (Qwen2.5/Llama), 32768 (Qwen3)

Compute: Not reported in the paper

Comparison to Prior Work

vs. PPO/GRPO: NSR uses *only* negative samples (failed attempts) to update the model, whereas PPO/GRPO use both positive and negative. The paper shows NSR alone is sufficient and better for diversity.
vs. SFT: PSR is essentially on-policy SFT. The paper shows this hurts Pass@k scaling compared to NSR.
vs. ReST/STaR: These methods typically iterate on positive data (self-training). This paper argues focusing on negative data (NSR) is more effective for reasoning priors [not cited in paper, conceptual comparison].

Limitations

RL training hurts Llama-3.1-8B-Instruct inference scaling (Pass@256 drops), suggesting backbone dependency
Effectiveness of NSR relies on the model having strong prior beliefs (latent knowledge) to redistribute probability toward
Analysis focused primarily on math reasoning tasks; generalization to other domains (coding, writing) not explored

Reproducibility

Code: https://github.com/TianHongZXY/RLVR-Decomposed

Code available at https://github.com/TianHongZXY/RLVR-Decomposed. Uses 'verl' framework. Hyperparameters provided.

📊 Experiments & Results

Evaluation Setup

Math reasoning benchmarks evaluated across a spectrum of sampling budgets (Pass@k)

Benchmarks:

MATH (Mathematical Reasoning)
AIME 2025 (Mathematical Competition)
AMC23 (Mathematical Competition)

Metrics:

Pass@k (k up to 256)
Pass@1 (Greedy-like accuracy)
Statistical methodology: Unbiased estimator for Pass@k using n > k samples

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Experiments using Qwen3-4B (non-thinking mode) to recover latent 'thinking' capabilities demonstrate NSR's ability to match strong baselines.
MATH	Pass@1	93.9	94.0	+0.1
MATH	Pass@64	98.2	98.0	-0.2
MATH	Pass@k (scaling trend)	Low	High	Positive

Experiment Figures

Pass@k inference scaling curves for Base, PSR, NSR, PPO, and GRPO across MATH, AIME, and AMC23.

Evolution of model entropy on a held-out test set during training.

Main Takeaways

NSR (Negative Sample Reinforcement) works by suppressing incorrect answers, which indirectly redistributes probability to correct ones while preserving diversity.
PSR (Positive Sample Reinforcement) collapses output distribution, improving greedy accuracy (Pass@1) but severely harming exploration capability (Pass@k for large k).
Weighted-REINFORCE (upweighting NSR) provides the best balance, consistently improving performance on difficult benchmarks like AIME 2025 and AMC23.
The choice of RL algorithm matters significantly for 'unlocking' latent capabilities (e.g., thinking mode in Qwen3); PSR fails to unlock this, while NSR succeeds.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, REINFORCE)
Language Model Fine-tuning
Inference Scaling (Pass@k metrics)

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—using objective correctness (e.g., math answers) to guide model updates

PSR: Positive Sample Reinforcement—updating the model to increase the probability of generated responses that are correct

NSR: Negative Sample Reinforcement—updating the model to decrease the probability of generated responses that are incorrect

Pass@k: A metric measuring the probability that at least one correct answer is found within k generated samples

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same prompt to reduce variance

Inference Scaling: Improving model performance by increasing the amount of computation (e.g., number of samples) used during test time

PPO: Proximal Policy Optimization—a standard RL algorithm that prevents the policy from changing too drastically in one step