Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them

📝 Paper Summary

Post-training of Reasoning Models Reinforcement Learning vs Supervised Fine-Tuning

GRPO acts as a scalpel that amplifies existing reasoning skills by subtly adjusting attention weights, whereas SFT acts as a hammer that aggressively updates mid-layer MLPs, improving specific tasks but degrading general knowledge.

Core Problem

The training dynamics of Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) for reasoning are poorly understood, particularly regarding why SFT often degrades general capabilities while RL preserves them.

Why it matters:

Reasoning models trained with SFT often suffer from 'tax' on general capabilities (e.g., knowledge benchmarks like MMLU)
Frontier models trained with RL have been shown to hallucinate more, creating a need to understand internal model changes
Current approaches lack controlled comparisons to determine whether skills are newly acquired or merely amplified existing capabilities

Concrete Example: When trained on math problems, an SFT model might memorize specific solution patterns (the 'hammer' approach), causing it to forget general facts on benchmarks like MMLU. In contrast, a GRPO model reinforces the correct solution path it already knows (the 'scalpel'), preserving its original knowledge base.

Key Novelty

Scalpel (GRPO) vs. Hammer (SFT) Hypothesis

Demonstrates that GRPO (Group Relative Policy Optimization) makes sparse, subtle updates primarily to attention query/key weights, acting as a 'scalpel' to amplify existing capabilities
Shows that SFT (Supervised Fine-Tuning) causes large-scale parameter shifts, particularly in mid-layer MLPs associated with factual memory, acting as a 'hammer' that overwrites existing skills

Evaluation Highlights

SFT leads to significant degradation on knowledge-intensive benchmarks (MMLU, MMLU-Pro) compared to GRPO, which maintains base model performance
SFT causes a rapid, early spike in KL divergence from the base model, indicating a drastic shift in output distribution, while GRPO divergence grows gradually
Parameter analysis reveals SFT heavily modifies mid-layer MLPs (linked to factual associations), whereas GRPO primarily affects attention query/key matrices

Breakthrough Assessment

7/10

Provides a crucial mechanistic explanation for the SFT-vs-RL trade-off in reasoning models. While the proposed mitigation (freezing layers) had mixed results, the diagnostic insights into 'where' the model changes are significant for future post-training strategies.

⚙️ Technical Details

Problem Definition

Setting: Post-training Large Language Models for mathematical reasoning

Inputs: Math questions (from NuminaMath 1.5)

Outputs: Reasoning chain-of-thought (CoT) followed by the final answer

Pipeline Flow

Input Math Question
Transformer Model (OLMo-2-7B-Instruct)
Generation of <think> steps and <answer>

System Modules

Base Model

Generate reasoning traces and answers

Model or implementation: OLMo-2-1124-7B-Instruct

Novel Architectural Elements

Comparative analysis framework freezing specific model components (MLP matrices vs Attention matrices) during SFT to isolate their impact on knowledge retention

Modeling

Base Model: OLMo-2-1124-7B-Instruct

Training Method: Comparison of GRPO (RL) and SFT

Objective Functions:

Purpose: (GRPO) Maximize expected reward of self-generated completions.

Formally: Group Relative Policy Optimization updates based on advantages derived from group scores
Purpose: (SFT) Minimize prediction error on synthetic reasoning traces.

Formally: Standard cross-entropy loss on next-token prediction

Training Data:

OpenR1-Math-220k dataset (subset of 91k K12-level questions used for stability)
Questions from NuminaMath 1.5
SFT targets: Reasoning traces generated by DeepSeek R1
GRPO targets: Self-generated completions scored by reward function

Key Hyperparameters:

learning_rate_GRPO: approx 1e-6
learning_rate_SFT: approx 5e-5
reward_weights: Accuracy: 1.0, Format: 0.2
+ 2 more
warmup_ratio: 0.1
precision: bfloat16

Compute: Not reported in the paper

Comparison to Prior Work

vs. DeepSeek R1: This paper performs a controlled ablation of SFT vs GRPO on the *same* base model and data to isolate dynamics, rather than building a production model
vs. Standard SFT: This paper investigates freezing MLPs during SFT to mitigate knowledge loss [not cited in paper as a standard baseline method, but an experimental intervention]

Limitations

Different learning rates were required for stability (1e-6 for GRPO vs 5e-5 for SFT), creating a confound in comparing update magnitudes
SFT dataset (DeepSeek R1 traces) differs from the few-shot examples the base model was optimized for, potentially corrupting existing few-shot abilities
Experiments regarding freezing modules (to fix SFT degradation) were inconclusive
Analysis is limited to the 7B parameter scale

Reproducibility

Based on OpenR1-Math-220k dataset. Hyperparameters for the final successful runs are provided in Appendix A (referenced). Code follows open-r1 repository but exact scripts for this paper are not explicitly linked.

📊 Experiments & Results

Evaluation Setup

Comparison of in-domain math performance vs out-of-domain general knowledge retention

Benchmarks:

MATH-500 (Mathematical Reasoning)
GSM8k (Grade School Math)
MMLU (General Knowledge)
GPQA-Diamond (Expert-level Science/QA)
AIME24 (Competition Math)

Metrics:

Accuracy
KL Divergence (per-token)
Normalized Frobenius Norm (parameter updates)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Training Stability	Learning Rate	Not reported in the paper	1e-6	Not reported in the paper

Experiment Figures

KL Divergence between the base model and checkpoints throughout training for GRPO and SFT.

Normalized Frobenius norm of parameter differences (base model vs trained model) across layers for GRPO.

Main Takeaways

SFT acts as a 'hammer': it provides larger gains on in-domain MATH-500 but significantly degrades performance on knowledge benchmarks (MMLU, MMLU-Pro) and existing skills (GSM8k few-shot).
GRPO acts as a 'scalpel': it yields modest in-domain gains but largely preserves out-of-domain capabilities and general knowledge.
Internal Dynamics: SFT causes large parameter updates in mid-layer MLPs (factual memory), while GRPO reinforces existing capabilities via subtle updates to attention query/key weights.
Distribution Shift: SFT causes a sharp, early increase in KL divergence (fitting a new distribution), whereas GRPO shows gradual divergence (refining the existing distribution).

📚 Prerequisite Knowledge

Prerequisites

Understanding of Transformer architecture (Attention vs. MLPs)
Basics of Reinforcement Learning (RL) vs. Supervised Learning
Familiarity with reasoning benchmarks (GSM8k, MATH)

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes a policy by comparing a group of outputs generated by the model itself against a reward function

SFT: Supervised Fine-Tuning—training a model to predict specific target tokens (reasoning traces) provided in a dataset

KL divergence: Kullback-Leibler divergence—a statistical distance measuring how one probability distribution (the trained model) differs from a reference distribution (the base model)

MLP: Multi-Layer Perceptron—the feed-forward networks within Transformer layers, often associated with storing factual knowledge

Frobenius norm: A matrix norm used here to quantify the magnitude of changes to model weight matrices (how much the parameters moved during training)

Chain-of-Thought (CoT): A prompting method where the model generates intermediate reasoning steps before the final answer

Pass@k: A metric measuring the probability that at least one correct solution is generated out of k attempts