← Back to Paper List

Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them

Neel Rajani, Aryo Pradipta Gema, Seraphina Goldfarb-Tarrant, Ivan Titov
arXiv (2025)
RL Reasoning Factuality Benchmark

📝 Paper Summary

Post-training of Reasoning Models Reinforcement Learning vs Supervised Fine-Tuning
GRPO acts as a scalpel that amplifies existing reasoning skills by subtly adjusting attention weights, whereas SFT acts as a hammer that aggressively updates mid-layer MLPs, improving specific tasks but degrading general knowledge.
Core Problem
The training dynamics of Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) for reasoning are poorly understood, particularly regarding why SFT often degrades general capabilities while RL preserves them.
Why it matters:
  • Reasoning models trained with SFT often suffer from 'tax' on general capabilities (e.g., knowledge benchmarks like MMLU)
  • Frontier models trained with RL have been shown to hallucinate more, creating a need to understand internal model changes
  • Current approaches lack controlled comparisons to determine whether skills are newly acquired or merely amplified existing capabilities
Concrete Example: When trained on math problems, an SFT model might memorize specific solution patterns (the 'hammer' approach), causing it to forget general facts on benchmarks like MMLU. In contrast, a GRPO model reinforces the correct solution path it already knows (the 'scalpel'), preserving its original knowledge base.
Key Novelty
Scalpel (GRPO) vs. Hammer (SFT) Hypothesis
  • Demonstrates that GRPO (Group Relative Policy Optimization) makes sparse, subtle updates primarily to attention query/key weights, acting as a 'scalpel' to amplify existing capabilities
  • Shows that SFT (Supervised Fine-Tuning) causes large-scale parameter shifts, particularly in mid-layer MLPs associated with factual memory, acting as a 'hammer' that overwrites existing skills
Evaluation Highlights
  • SFT leads to significant degradation on knowledge-intensive benchmarks (MMLU, MMLU-Pro) compared to GRPO, which maintains base model performance
  • SFT causes a rapid, early spike in KL divergence from the base model, indicating a drastic shift in output distribution, while GRPO divergence grows gradually
  • Parameter analysis reveals SFT heavily modifies mid-layer MLPs (linked to factual associations), whereas GRPO primarily affects attention query/key matrices
Breakthrough Assessment
7/10
Provides a crucial mechanistic explanation for the SFT-vs-RL trade-off in reasoning models. While the proposed mitigation (freezing layers) had mixed results, the diagnostic insights into 'where' the model changes are significant for future post-training strategies.
×