RL Is Neither a Panacea Nor a Mirage: Understanding Supervised vs. Reinforcement Learning Fine-Tuning for LLMs

📝 Paper Summary

Post-training dynamics Mechanistic interpretability

Reinforcement learning fine-tuning primarily acts as a restoration mechanism that reverses the directional drift of singular vectors caused by aggressive supervised fine-tuning, rather than creating new generalization capabilities.

Core Problem

Supervised Fine-Tuning (SFT) improves in-distribution performance but causes catastrophic forgetting of out-of-distribution (OOD) reasoning abilities as training progresses.

Why it matters:

Current two-stage training (SFT then RL) is empirically popular but lacks a mechanistic explanation for why RL recovers performance lost during SFT
Practitioners need actionable guidance on how long to run SFT to avoid irreversible damage to model capabilities before switching to RL
Understanding the spectral dynamics of weight matrices can lead to cheaper restoration methods than full RL fine-tuning

Concrete Example: In a 24-point card game, SFT trains a model to solve standard puzzles (ID), but as it overfits, it loses the ability to solve a variant where face cards (J, Q, K) represent different values (OOD). The paper shows RL can recover this lost ability unless SFT has pushed the model into a regime of severe overfitting.

Key Novelty

RL as Spectral Restoration

Demonstrates that RL's primary role in post-training is restoring OOD capabilities lost during SFT by reversing specific directional shifts in weight matrices
Uses Singular Value Decomposition (SVD) to show that performance changes are driven by the rotation of singular vectors (directions), not changes in singular values (magnitudes)
Proposes that low-rank restoration of just the top singular vectors can recover significant OOD performance without full RL training

Evaluation Highlights

RL restores up to 99% of OOD performance lost during SFT for Qwen-2.5-7B (17.09% → 19.66%) and 85% for Llama-3.2-11B (8.97% → 15.38%)
Restoring singular vector directions for just the top 20% of singular values recovers 70-80% of the model's OOD performance without full training
Identifies a 'point of no return': if SFT overfits severely (pushing the model into a distinct representation regime), RL fails to recover OOD abilities

Breakthrough Assessment

8/10

Provides a strong mechanistic explanation for a widely observed phenomenon (RL fixing SFT forgetting). The finding that singular vector rotation matters more than magnitude challenges existing spectral analysis assumptions.

⚙️ Technical Details

Problem Definition

Setting: Two-stage post-training (SFT followed by RL) of Large Language Models

Inputs: Prompt $x$ describing a card game state

Outputs: Generated equation string $y$ that solves the game

Pipeline Flow

Base Model Initialization
Supervised Fine-Tuning (SFT)
Reinforcement Learning (RL) Fine-Tuning
Spectral Analysis (Diagnostic)

System Modules

Base Model

Pre-trained LLM serves as the starting point

Model or implementation: Llama-3.2-11B or Qwen-2.5-7B

SFT Trainer

Adapt model to task format via cross-entropy loss

Model or implementation: Same as base

RL Trainer (PPO)

Optimize policy using reward signal to recover generalization

Model or implementation: Initialized from SFT checkpoint

Modeling

Base Model: Llama-3.2-11B and Qwen-2.5-7B

Training Method: PPO (Proximal Policy Optimization) following SFT

Objective Functions:

Purpose: SFT minimizes negative log-likelihood of target outputs.

Formally: $L_{SFT} = -\sum \log P(y_i|x_i)$
Purpose: RL maximizes expected reward.

Formally: $J(\theta) = E[R(x)]$
Purpose: PPO optimizes clipped surrogate objective to stabilize updates.

Formally: $L^{CLIP} = E[\min(r_t(\theta)A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)A_t)]$

Adaptation: Full fine-tuning (all parameters)

Trainable Parameters: Full parameter set

Key Hyperparameters:

rl_initialization_checkpoint: 1100 steps for Llama-3.2, 800 steps for Qwen-2.5
ppo_clip_epsilon: Standard PPO settings implied
inter_sft_checkpoint: Early-stage snapshot (e.g., 20% mark) used as baseline for peak OOD

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard SFT: Shows SFT degrades OOD performance monotonically after an early peak, while RL reverses this degradation
vs. Staats et al. (2025): Finds that singular *values* remain stable; performance shifts are driven by singular *vector* rotations [cited in paper]
vs. LoRA/QLoRA [not cited in paper]: Paper uses full fine-tuning; analysis suggests low-rank updates (like LoRA) might naturally preserve bulk spectrum, aligning with findings on low-rank recoverability

Limitations

Analysis restricted to arithmetic reasoning tasks (GeneralPoints); generalization to other domains like coding or creative writing is unverified
Experiments limited to two model families (Llama and Qwen) and specific sizes (7B, 11B)
Does not explain *why* SFT and RL converge on specific rotation profiles, only that they do
Computational cost of full SVD analysis on larger models (70B+) could be prohibitive

Reproducibility

No specific code repository or artifacts are linked in the paper text. Base models (Llama, Qwen) and benchmark (GeneralPoints) are public. Spectral analysis method (SVD + Principal Angles) is mathematically standard and reproducible from description.

📊 Experiments & Results

Evaluation Setup

Controlled arithmetic reasoning task with distinct ID and OOD variations

Benchmarks:

GeneralPoints (ID) (Arithmetic Reasoning (24-point game))
GeneralPoints Variation (OOD) (Arithmetic Reasoning (Modified card values))

Metrics:

Accuracy (ID and OOD)
Loss (ID and OOD)
Principal Angles (between weight matrices)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance trajectories show RL recovering OOD accuracy lost during late-stage SFT.
GeneralPoints (OOD)	Accuracy	8.97	15.38	+6.41
GeneralPoints (OOD)	Accuracy	17.09	19.66	+2.57
GeneralPoints (OOD)	Recovery %	100	70-80	N/A

Experiment Figures

Evolution of OOD vs. ID accuracy across training steps for SFT and subsequent RL

Change in singular values before and after fine-tuning

Main Takeaways

Better in, better out: Stronger SFT checkpoints (before severe overfitting) allow for better RL rescue of OOD abilities.
OOD performance peaks early during SFT and then degrades (catastrophic forgetting), while ID performance continues to rise.
Spectral analysis reveals that singular values (magnitudes) remain stable during both SFT and RL; the "energy" of transformation is preserved.
Low-rank and shallow layer restoration is surprisingly effective, suggesting forgetting is concentrated in specific high-importance directions.
If SFT pushes the model into a severe overfitting regime, RL cannot fully recover the lost capabilities (the "point of no return").

📚 Prerequisite Knowledge

Prerequisites

Singular Value Decomposition (SVD) linear algebra concepts
Basics of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL)
Concept of Out-of-Distribution (OOD) generalization

Key Terms

SFT: Supervised Fine-Tuning—training a model to minimize loss on a labeled dataset of correct examples

RL-FT: Reinforcement Learning Fine-Tuning—optimizing a model to maximize a reward signal (e.g., correct answer) using algorithms like PPO

OOD: Out-of-Distribution—performance on tasks or data variations not seen during specific fine-tuning, testing general reasoning capabilities

ID: In-Distribution—performance on the specific task format used during fine-tuning

SVD: Singular Value Decomposition—a method to factorize a matrix into singular vectors (directions) and singular values (magnitudes/importance)

PPO: Proximal Policy Optimization—a reinforcement learning algorithm that updates the model policy while preventing drastic changes to maintain stability

Principal Angles: A geometric measure of how much two subspaces (e.g., defined by weight matrices) have rotated relative to each other

GeneralPoints: A benchmark card game task requiring arithmetic reasoning to reach a target number (like the 24 game)

Singular Vectors: The directional components in SVD (U and V matrices) that define how the weight matrix rotates input data