Reinforcement Learning vs. Distillation: Understanding Accuracy and Capability in LLM Reasoning

📝 Paper Summary

LLM Reasoning Reinforcement Learning with Verifiable Rewards (RLVR) Knowledge Distillation

RLVR improves accuracy by reinforcing easy problems at the expense of hard ones without expanding capability, whereas distillation can expand capability only when it introduces new knowledge.

Core Problem

While RLVR improves the probability of generating correct answers (accuracy), it often fails to enable models to solve previously unsolvable problems (capability), and the mechanisms behind this limitation are poorly understood.

Why it matters:

Understanding whether methods like RLVR actually teach new skills or just amplify existing ones is critical for developing Artificial General Intelligence (AGI)
Current reasoning models often plateau in capability despite expensive RL training cycles
Distinguishing between 'reasoning patterns' and 'knowledge' helps optimize training recipes (e.g., when to use distillation vs. RL)

Concrete Example: A base model might solve a hard math problem 1 out of 256 times. After RLVR training, instead of improving, the model might solve it 0 times because the training algorithm (GRPO) ignores batches where all responses are wrong, causing the model to optimize only for easier questions it can already solve frequently.

Key Novelty

The 'Sacrificing-Difficult-Problems' Phenomenon in RLVR

Demonstrates that RLVR improves overall accuracy by boosting performance on easy questions while actually degrading performance on the hardest questions (those with near-zero initial success probability)
Attributes this to policy gradient algorithms like GRPO, which update parameters based only on questions where at least one correct response is generated, effectively ignoring difficult problems
Disentangles 'reasoning patterns' from 'knowledge' in distillation, showing that distilling reasoning patterns alone (like RLVR) improves accuracy, while distilling from a teacher with more knowledge expands capability

Architecture

A transition matrix showing how questions move between success-rate bins before and after RLVR training

Evaluation Highlights

RLVR increases test accuracy of Qwen2.5-1.5B-Math from 62.6% to 74.8% on MATH 500, but fails to improve capability (pass@256), which often remains stagnant.
In difficulty bin analysis, questions with low initial success probability (1-4 successes out of 256) stagnate or regress: 16.7% of such questions drop to 0 successes after RLVR training.
Distilling the RLVR model's responses into the base model achieves 74.2% accuracy (matching the RLVR model itself), whereas self-distillation (base model on itself) only reaches 63.4%.

Breakthrough Assessment

7/10

Provides a crucial diagnostic insight into why RLVR fails to generalize to harder problems. While it doesn't propose a new SOTA method, the analysis of the 'sacrificing' mechanism is a valuable contribution to the science of LLM training.

⚙️ Technical Details

Problem Definition

Setting: Mathematical reasoning tasks where correctness can be automatically verified

Inputs: Math problem q

Outputs: Reasoning trace (Chain of Thought) and final answer a

Pipeline Flow

Input Question
Reasoning Generation (LLM)
Answer Extraction & Verification

System Modules

Reasoning Generator

Generate step-by-step reasoning (Chain of Thought) and final answer

Model or implementation: Qwen2.5-1.5B-Math or Qwen2.5-3B

Modeling

Base Model: Qwen2.5-1.5B-Math and Qwen2.5-3B

Training Method: Reinforcement Learning with Verifiable Rewards (RLVR) via GRPO

Objective Functions:

Purpose: Optimize policy to maximize expected reward (correctness).

Formally: GRPO objective (Group Relative Policy Optimization) which updates based on the advantage of a response relative to the group average for the same question.

Adaptation: Full fine-tuning

Training Data:

MATH training set (7,500 questions)
MATH 500 test set (500 questions) for evaluation
Responses generated: 256 per question for analysis

Compute: Not reported in the paper

Comparison to Prior Work

vs. ProRL: This paper finds RLVR *sacrifices* hard problems rather than solving them in typical uncontrolled settings
vs. STaR: Shows that simple self-distillation (STaR-like) overfits and underperforms RLVR, whereas distilling RLVR outputs works better
vs. DeepSeek-R1 Distillation: Shows that capability expansion comes from the teacher's knowledge, not just the distillation process itself

Limitations

Experiments limited to small models (1.5B and 3B parameters)
Focuses primarily on math reasoning; generalization to coding or other domains not tested
Analysis relies on a specific RL algorithm (GRPO); other algorithms like PPO might behave differently (though authors argue the root cause is policy gradient nature)
Does not explore the effect of scaling up the number of samples during RL training beyond the settings used

Reproducibility

Code: https://github.com/minwukim/RLvsDistillation

Code available at https://github.com/minwukim/RLvsDistillation. The paper uses public datasets (MATH) and open weights models (Qwen2.5). Hyperparameters for GRPO are referenced in Appendix A.9 (not fully detailed in snippet).

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning on the MATH dataset

Benchmarks:

MATH (Mathematical Problem Solving)

Metrics:

Accuracy (pass@1)
Capability (pass@k, k=256)
Entropy of output distribution
Statistical methodology: Estimators for accuracy and capability provided with confidence intervals discussed (e.g., 95% confidence for out-of-distribution threshold)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MATH 500 (Test)	Accuracy (Pass@1)	62.6	74.8	+12.2
MATH 500 (Test)	Accuracy (Pass@1)	63.4	74.2	+10.8
Difficulty bin analysis shows that improvement is concentrated in easy/medium questions, while hard questions see negligible gain or regression.
MATH Train	Success Rate Gain	0.0	0.5	+0.5
MATH Train	Success Rate Gain	0.0	36.6	+36.6

Experiment Figures

Absolute improvement in average success rate per difficulty bin (Base vs RL Model)

Main Takeaways

RLVR acts as a specialized sharpener: it drastically improves consistency on problems the model already knows how to solve (medium difficulty).
RLVR fails to unlock new capabilities: problems with near-zero initial success probability often see their success probability drop further because the gradient signal is dominated by easier, solvable questions.
Self-distillation is insufficient to replicate RLVR's gains, suggesting RLVR generates higher-quality training data during exploration than what exists in the static base model distribution.
Distillation improves capability (pass@k) only when the teacher model possesses superior knowledge (e.g., DeepSeek-R1), not just better reasoning patterns.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals
LLM fine-tuning (SFT)
Pass@k metrics
Policy Gradient methods

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—fine-tuning LLMs using binary rewards based on whether the final answer matches the ground truth

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same input to reduce variance, often used without a separate critic model

Accuracy: The probability of generating a correct answer in a single attempt (pass@1)

Capability: The probability that a correct answer exists in the model's output distribution (approximated by pass@k with large k, e.g., k=256)

Pass@k: A metric measuring the probability that at least one correct answer is generated out of k independent samples

Distillation: Supervised fine-tuning of a student model on outputs generated by a teacher model (or itself)

In-distribution: Questions where the model has a non-negligible probability (e.g., > 1%) of generating a correct answer

Self-Distillation: Fine-tuning a model on its own correct responses to valid problems

Rejection Sampling: A method of filtering generated data to keep only the samples that meet a certain criteria (e.g., correct answer) for training