SED-SFT: Selectively Encouraging Diversity in Supervised Fine-Tuning

📝 Paper Summary

LLM Post-training Supervised Fine-Tuning (SFT) Reinforcement Learning (RL)

SED-SFT prevents mode collapse by applying an entropy regularization penalty only to tokens with high exploration potential, thereby improving diversity for subsequent reinforcement learning without degrading accuracy on rigid tokens.

Core Problem

Standard Cross-Entropy (CE) loss in SFT drives models to converge on single ground-truth paths (mode collapse), severely restricting the exploration space required for effective downstream Reinforcement Learning.

Why it matters:

Lack of diversity during SFT limits the model's ability to explore alternative reasoning paths during RL, leading to suboptimal final performance
Existing solutions like GEM blindly encourage diversity on all tokens, damaging accuracy on rigid structural tokens (e.g., 'The answer is') where no diversity is needed

Concrete Example: In a math proof, the phrase 'The answer is' requires rigid adherence to syntax (low exploration need), while the subsequent reasoning steps allow for varied approaches. Naive diversity methods force the model to lower its confidence on 'is', hurting coherence. SED-SFT detects the low exploration space of 'is' and masks the penalty, applying it only to the reasoning tokens.

Key Novelty

Selectively Encouraging Diversity (SED)

Identifies 'flexible' vs. 'rigid' tokens using the cumulative probability of the top-k predicted tokens as a proxy for exploration space
Applies a masking mechanism to the entropy regularization term, ensuring the model is only pushed towards diversity (probability 0.5) on tokens where alternative valid paths likely exist

Architecture

Concept diagram of SED-SFT showing how the loss function treats different tokens differently

Evaluation Highlights

+2.06 points average improvement on Llama-3.2-3B-Instruct across 8 math benchmarks after RL compared to standard Cross-Entropy SFT
+1.20 points average improvement on Qwen2.5-Math-7B-Instruct across 8 math benchmarks after RL compared to standard Cross-Entropy SFT
Outperforms specialized SFT baselines (GEM and DFT) which failed to consistently improve over simple Cross-Entropy in the SFT-then-RL setting

Breakthrough Assessment

7/10

Offers a logically sound and effective fix for SFT mode collapse with consistent gains. While the scope is limited to math/reasoning SFT strategies, the selective masking idea is a valuable refinement over indiscriminate entropy regularization.

⚙️ Technical Details

Problem Definition

Setting: Supervised Fine-Tuning (SFT) of Large Language Models followed by Reinforcement Learning (RL) for mathematical reasoning

Inputs: Instruction prompts containing mathematical problems $x$ and ground truth solution sequences $y^*$

Outputs: Generated reasoning steps and final answers

Pipeline Flow

SFT Phase (SED-SFT training)
RL Phase (GRPO training)

System Modules

LLM Backbone

Generate mathematical reasoning steps and answers

Model or implementation: Llama-3.2-3B-Instruct or Qwen2.5-Math-7B-Instruct

Modeling

Base Model: Qwen2.5-Math-7B-Instruct and Llama-3.2-3B-Instruct

Training Method: SFT with Selective Diversity Encouragement followed by GRPO (Group Relative Policy Optimization)

Objective Functions:

Purpose: Minimize standard prediction error on ground truth tokens.

Formally: Cross-Entropy Loss L_CE
Purpose: Penalize high confidence on correct labels to encourage probability mass distribution (diversity), but only on masked tokens.

Formally: L_DE(p) = 4(p - 0.5)^2, minimized when p approaches 0.5
Purpose: Dynamically select which tokens receive the diversity penalty based on exploration space.

Formally: Mask M_t = 1 if Cumulative_TopK_Prob(t) < Threshold, else 0
Purpose: Combine objectives.

Formally: L_total = L_CE + lambda * M_t * L_DE

Training Data:

SFT: 20,000 examples sampled from Micomind dataset
RL: Math (Level 1) training split, filtered to 2,069 prompts

Key Hyperparameters:

sft_learning_rate: 2e-5
rl_batch_size: 256
lambda: 1 (weight of diversity term)
+ 3 more
k: 2 or 3 (for top-k metric)
masking_ratio_r: > 0.5 (typically)
rl_samples_per_prompt: 8

Compute: 8 NVIDIA H20 GPUs

Comparison to Prior Work

vs. GEM: SED-SFT uses a selective mask to prevent encouraging diversity on rigid tokens (high cumulative prob), whereas GEM encourages it blindly, hurting accuracy
vs. DFT: DFT restricts gradients on high-confidence tokens to prevent overfitting, but SED-SFT actively pushes probability towards 0.5 on flexible tokens to expand exploration

Limitations

Relies on a heuristic (cumulative top-k probability) to define exploration space, which requires tuning k and r
Experiments focused solely on mathematical reasoning tasks; generalizability to creative writing or other domains is untested
Requires an SFT-then-RL pipeline; benefits for standalone SFT (without subsequent RL) are less emphasized
No statistical significance tests reported for the performance gains

Reproducibility

Code: https://github.com/pppa2019/SED-SFT

Code is publicly available. Datasets (Micomind, Math Level 1) are public. Standard architectures (Llama-3.2, Qwen2.5) are used. Hyperparameters for SFT and RL (GRPO) are provided.

📊 Experiments & Results

Evaluation Setup

Train on SFT, then train on RL (GRPO), then evaluate on unseen math benchmarks

Benchmarks:

AIME24 (Mathematical Competition)
AIME25 (Mathematical Competition)
AMC23 (Mathematical Competition)
GSM8K (Grade School Math)
MATH500 (Advanced Math)
GAOKAO-en (College Entrance Exam Math)
OlympiadBench (Olympiad Math)
College-MATH (College Math)

Metrics:

Pass rate (Average across benchmarks)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance comparison of SED-SFT against standard Cross-Entropy (CE) and other SFT baselines after the subsequent RL phase.
Average (8 benchmarks)	Pass Rate	47.90	49.96	+2.06
Average (8 benchmarks)	Pass Rate	62.00	63.20	+1.20
Average (8 benchmarks)	Pass Rate	47.28	49.96	+2.68
Average (8 benchmarks)	Pass Rate	46.33	49.96	+3.63

Experiment Figures

Heatmap of label probability distribution across tokens in a sample sentence

Cumulative probability distributions for different k values across 100 math problems

Main Takeaways

SED-SFT consistently improves downstream RL performance compared to standard CE loss and other diversity-focused baselines (GEM, DFT)
Baselines like DFT performed well during the SFT phase (accuracy-wise) but restricted exploration so much that RL could not recover, leading to worse final performance
Analysis confirms SED-SFT increases sentence-level diversity (lower Self-BLEU) compared to CE and DFT

📚 Prerequisite Knowledge

Prerequisites

Supervised Fine-Tuning (SFT) standard practices
Reinforcement Learning (RL) for LLMs (specifically GRPO)
Cross-Entropy Loss
Entropy Regularization

Key Terms

SFT: Supervised Fine-Tuning—training a pre-trained model on labeled instruction-response pairs

RL: Reinforcement Learning—training an agent (the LLM) to maximize a reward signal (e.g., correct answer)

GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes the policy based on the relative performance of a group of outputs for the same prompt

Mode Collapse: A failure mode where the model converges to a narrow set of outputs, losing the diversity needed to explore different solution paths

Top-k: The set of k tokens with the highest predicted probabilities at a given decoding step

Exploration Space: The set of plausible alternative tokens or reasoning paths available at a specific generation step