SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning

📝 Paper Summary

LLM Reasoning Post-training / Alignment

SRFT unifies supervised learning and reinforcement learning into a single training stage by using entropy to dynamically balance learning from demonstrations versus self-exploration.

Core Problem

Sequential fine-tuning (Supervised Fine-Tuning followed by Reinforcement Learning) often fails because SFT can induce overfitting that limits RL's exploration, while pure RL is sample-inefficient and prone to mode collapse.

Why it matters:

Separating SFT and RL creates a trade-off: SFT provides knowledge but hurts exploration, while RL optimizes rewards but struggles with sparse signals.
Sequential SFT→RL often leads to suboptimal policies where the model forgets SFT knowledge or gets stuck in local optima near the base policy.
Practitioners struggle to balance using expert demonstrations vs. letting the model explore new solutions.

Concrete Example: In a math problem, an SFT-only model might memorize a specific solution format but fail on a variation. A sequential SFT→RL model might 'forget' the correct reasoning steps during RL exploration. SRFT trains on both simultaneously, keeping the demonstration guidance while rewarding novel correct solutions.

Key Novelty

Supervised Reinforcement Fine-Tuning (SRFT)

Simultaneously applies SFT loss (on expert demonstrations) and RL loss (on model-generated rollouts) in a single training loop.
Uses entropy (uncertainty) as a dynamic indicator to weigh the contributions: preventing the model from becoming too deterministic (overfitting) while ensuring it converges.
Treats positive and negative RL samples differently to stabilize exploration while anchoring the model to high-quality demonstrations.

Architecture

Conceptual diagram of SRFT method compared to sequential SFT->RL.

Evaluation Highlights

Achieves 59.1% average accuracy on five math benchmarks (including MATH, AIME24), outperforming zero-RL baselines by 9.0%.
Demonstrates strong generalization with a 10.9% improvement on out-of-distribution benchmarks compared to zero-RL methods.
Outperforms sequential SFT→RL approaches, avoiding the performance degradation often seen when initializing RL from an overfitted SFT model.

Breakthrough Assessment

7/10

Strong empirical results on math reasoning and a principled analysis of entropy dynamics. The single-stage integration is effective, though conceptually an evolution of existing hybrid losses rather than a paradigm shift.

⚙️ Technical Details

Problem Definition

Setting: Fine-tuning Large Language Models (LLMs) for reasoning tasks using both demonstration datasets and reinforcement learning signals.

Inputs: Prompt x and demonstration target y (from dataset), plus model-generated trajectories.

Outputs: Optimized policy π_θ that maximizes reasoning accuracy.

Pipeline Flow

Input Prompt Processing
Reasoning/Generation
Output Selection (Greedy/Sampling)

System Modules

LLM Policy

Generates reasoning steps and final answer for a given math problem

Model or implementation: Qwen-2.5-Math-7B

Novel Architectural Elements

Unified Loss Integration: Combines SFT loss on demonstrations and GRPO loss on rollouts within the same optimization step, weighted by entropy-aware mechanisms.

Modeling

Base Model: Qwen-2.5-Math-7B

Training Method: SRFT (Supervised Reinforcement Fine-Tuning), utilizing GRPO for the RL component

Objective Functions:

Purpose: Combine supervised learning from gold data with reinforcement learning from exploration.

Formally: L_SRFT = L_SFT + L_RL.
Purpose: Standard supervised learning on demonstration data.

Formally: L_SFT = - Σ log π_θ(y_j | y_<j, x).
Purpose: Optimize policy via Group Relative Policy Optimization on generated rollouts.

Formally: L_RL uses clipped surrogate objective based on advantage estimates from group rewards.

Adaptation: Full fine-tuning

Training Data:

Demonstration data: High-quality math problems with solutions
Self-exploration data: Generated on-the-fly during training via rollouts

Key Hyperparameters:

learning_rate: Not explicitly reported in the paper
batch_size: Not explicitly reported in the paper
clip_range_epsilon: Standard GRPO setting (implied)
+ 2 more
SFT_steps_in_baseline: 350 steps (for sequential comparison)
RL_steps_in_baseline: 150 steps (for sequential comparison)

Compute: Not reported in the paper

Comparison to Prior Work

vs. SFT->RL: SRFT prevents the 'catastrophic forgetting' or overfitting often seen after the initial SFT stage.
vs. Pure RL: SRFT leverages demonstration data to improve sample efficiency and guide exploration.
vs. DeepSeek-R1: SRFT is a training method applied to a smaller base model (7B) to improve its reasoning, rather than a specific architecture [not cited as a direct architectural baseline, but as a reference point].

Limitations

Hyperparameters (learning rate, etc.) are not explicitly listed in the main text.
The analysis is primarily focused on mathematical reasoning; applicability to other domains (coding, creative writing) is less explored.
Computational cost comparison between single-stage SRFT and two-stage SFT->RL is not detailed beyond 'training efficiency' claims.

Reproducibility

Code: https://anonymous.4open.science/w/SRFT2025

Code is available at https://anonymous.4open.science/w/SRFT2025 and model weights at HuggingFace. Specific hyperparameters like learning rate and batch size are not detailed in the text, though baselines are tuned.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks using Chain-of-Thought (CoT) generation.

Benchmarks:

AIME24 (Competition Math)
AMC (Competition Math)
MATH500 (Math Problem Solving)
Minerva (Scientific Reasoning)
OlympiadBench (Competition Math)
Gaokao (OOD Math (Chinese Entrance Exam))
CMATH (OOD Math)
CollegeMath (OOD Math)

Metrics:

Accuracy (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
SRFT outperforms standard baselines on in-distribution mathematical reasoning benchmarks.
Average (5 Math Benchmarks)	Accuracy	50.1	59.1	+9.0
SRFT demonstrates superior generalization on out-of-distribution (OOD) benchmarks.
Average (3 OOD Benchmarks)	Accuracy	Not reported in the paper	Not reported in the paper	+10.9
Average (All Benchmarks)	Accuracy	Not reported in the paper	Not reported in the paper	+4.7

Experiment Figures

Visualization of token probability changes for SFT vs. RL.

Entropy dynamics during different fine-tuning sequences (SFT->RL vs. RL->SFT).

Main Takeaways

SRFT significantly outperforms both SFT-only and sequential SFT->RL methods across mathematical benchmarks.
Analysis reveals SFT causes coarse-grained global distribution changes, while RL performs fine-grained local optimization.
Sequential RL->SFT (RL first, then SFT) performs poorly because SFT destroys the fine-tuned policy structure established by RL.
Single-stage integration prevents the transient performance degradation observed in sequential SFT->RL (attributed to forgetting or misalignment).

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) concepts (Policy Gradient, PPO/GRPO)
Supervised Fine-Tuning (SFT) objectives
Entropy and KL Divergence

Key Terms

SFT: Supervised Fine-Tuning—training the model to maximize the likelihood of ground-truth demonstrations.

RL: Reinforcement Learning—training the model to maximize a reward signal (e.g., correct answer) through exploration.

GRPO: Group Relative Policy Optimization—a memory-efficient RL algorithm that normalizes rewards within a group of outputs for the same prompt, removing the need for a value function.

Entropy: A measure of the randomness or uncertainty in the model's output distribution. High entropy means diverse outputs; low entropy means deterministic outputs.

OOD: Out-of-Distribution—test data that differs significantly from the training distribution.

Mode Collapse: A failure mode in generative models where the model produces limited varieties of samples (e.g., repeating the same answer).

KL Divergence: Kullback-Leibler divergence—a statistical distance measuring how one probability distribution differs from a reference distribution.