← Back to Paper List

RL Is Neither a Panacea Nor a Mirage: Understanding Supervised vs. Reinforcement Learning Fine-Tuning for LLMs

Hangzhan Jin, Sicheng Lv, Sifan Wu, Mohammad Hamdaqa
Polytechnique Montréal
arXiv.org (2025)
RL Reasoning Benchmark

📝 Paper Summary

Post-training dynamics Mechanistic interpretability
Reinforcement learning fine-tuning primarily acts as a restoration mechanism that reverses the directional drift of singular vectors caused by aggressive supervised fine-tuning, rather than creating new generalization capabilities.
Core Problem
Supervised Fine-Tuning (SFT) improves in-distribution performance but causes catastrophic forgetting of out-of-distribution (OOD) reasoning abilities as training progresses.
Why it matters:
  • Current two-stage training (SFT then RL) is empirically popular but lacks a mechanistic explanation for why RL recovers performance lost during SFT
  • Practitioners need actionable guidance on how long to run SFT to avoid irreversible damage to model capabilities before switching to RL
  • Understanding the spectral dynamics of weight matrices can lead to cheaper restoration methods than full RL fine-tuning
Concrete Example: In a 24-point card game, SFT trains a model to solve standard puzzles (ID), but as it overfits, it loses the ability to solve a variant where face cards (J, Q, K) represent different values (OOD). The paper shows RL can recover this lost ability unless SFT has pushed the model into a regime of severe overfitting.
Key Novelty
RL as Spectral Restoration
  • Demonstrates that RL's primary role in post-training is restoring OOD capabilities lost during SFT by reversing specific directional shifts in weight matrices
  • Uses Singular Value Decomposition (SVD) to show that performance changes are driven by the rotation of singular vectors (directions), not changes in singular values (magnitudes)
  • Proposes that low-rank restoration of just the top singular vectors can recover significant OOD performance without full RL training
Evaluation Highlights
  • RL restores up to 99% of OOD performance lost during SFT for Qwen-2.5-7B (17.09% → 19.66%) and 85% for Llama-3.2-11B (8.97% → 15.38%)
  • Restoring singular vector directions for just the top 20% of singular values recovers 70-80% of the model's OOD performance without full training
  • Identifies a 'point of no return': if SFT overfits severely (pushing the model into a distinct representation regime), RL fails to recover OOD abilities
Breakthrough Assessment
8/10
Provides a strong mechanistic explanation for a widely observed phenomenon (RL fixing SFT forgetting). The finding that singular vector rotation matters more than magnitude challenges existing spectral analysis assumptions.
×