DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning

📝 Paper Summary

Reinforcement Learning for Reasoning Post-training Data Curriculum

DeReason improves general reasoning by partitioning data based on difficulty, using easy samples for Supervised Fine-Tuning to build knowledge and hard samples for Reinforcement Learning to refine complex reasoning.

Core Problem

Applying pure Reinforcement Learning (RL) directly to base models for general STEM reasoning is sample-inefficient and often underperforms simple Supervised Fine-Tuning (SFT) because models lack the necessary domain knowledge foundation.

Why it matters:

Current trends prioritize pure RL (like DeepSeek-R1-Zero) for reasoning, but this often fails in general scientific domains where broad knowledge is prerequisite.
Blindly mixing easy and hard data for both SFT and RL is inefficient; easy data doesn't benefit from RL's costly exploration, while hard data is wasted in SFT if the teacher's reasoning is imperfect.
Acquiring domain knowledge (physics formulae, facts) is hard through trial-and-error RL, making SFT a critical but often misallocated component in modern post-training pipelines.

Concrete Example: In general STEM tasks, a base model trained with pure RL on physics problems struggles to discover correct formulae from scratch. Conversely, SFT on complex multi-step derivations often leads to rote memorization of the teacher's specific path rather than true reasoning generalization.

Key Novelty

DeReason (Difficulty-based Decoupling)

Use an LLM to score problem difficulty (1-5); low-difficulty problems (knowledge recall) are routed to SFT to efficiently distill domain knowledge.
High-difficulty problems (reasoning-intensive) are reserved for RLVR, where the model initializes from the SFT checkpoint and explores reasoning paths beyond the teacher's demonstrations.
Decouples the 'knowledge acquisition' phase (best done via SFT) from the 'reasoning refinement' phase (best done via RL) based on data characteristics rather than just training stages.

Evaluation Highlights

SFT on moderate-quality data consistently outperforms pure RLVR on base models across math and STEM benchmarks (e.g., GPQA-Diamond), challenging the 'RL is all you need' narrative.
DeReason curriculum (SFT on easy, RL on hard) outperforms pure SFT, pure RL, and random-split SFT-then-RL baselines on Qwen3-4B-Base.
On challenging benchmarks like BBEH (reasoning-focused), the decoupled pipeline yields clear improvements over SFT-only baselines, while gaps are smaller on knowledge-heavy tasks like MMLU-Pro.

Breakthrough Assessment

7/10

Provides a pragmatic, empirically grounded recipe for combining SFT and RL. While not algorithmically novel, the systematic analysis of data allocation based on difficulty offers a valuable engineering insight for post-training.

⚙️ Technical Details

Problem Definition

Setting: General reasoning tasks (STEM, Math) where outcomes can be verified either by rules or model-based judges.

Inputs: Natural language problem x

Outputs: Reasoning chain and final answer o

Pipeline Flow

Difficulty Scorer (LLM)
Data Partitioner (Split into Easy/Hard)
SFT Trainer (Knowledge Acquisition)
RL Trainer (Reasoning Refinement)

System Modules

Difficulty Scorer

Assigns a difficulty score (1-5) to each training problem based on reasoning steps and prerequisite knowledge.

Model or implementation: Qwen3-4B-Instruct (same size as policy)

SFT Trainer

Establishes foundational domain knowledge using broader, easier data.

Model or implementation: Qwen3-4B-Base

RL Trainer (GRPO)

Refines reasoning capabilities on complex problems using verifiable rewards.

Model or implementation: Initialized from π_SFT

Novel Architectural Elements

Difficulty-based data routing curriculum: Explicitly routing low-reasoning-intensity data to SFT and high-reasoning-intensity data to RL, rather than random splitting or sequential full-dataset training.

Modeling

Base Model: Qwen3-4B-Base

Training Method: Sequential SFT then RL (GRPO) with data decoupling

Objective Functions:

Purpose: SFT standard objective.

Formally: Maximize log p(y|x) for (x,y) in D_SFT.
Purpose: GRPO RL objective.

Formally: Maximize E[min(rho * A, clip(rho, 1-eps, 1+eps) * A)] - beta * D_KL, where A is group-normalized advantage.

Training Data:

WebInstruct-Verified (General STEM)
Webscale-RL (General STEM)
Math datasets (AIME24, AIME25, MATH500) for evaluation
Teacher responses generated by Qwen3-4B-Instruct

Key Hyperparameters:

sft_learning_rate: 1e-5
sft_batch_size: 128
rl_learning_rate: 1e-6
+ 4 more
rl_batch_size: 128
rl_mini_batch: 64
max_response_length: 8192
difficulty_threshold: Score >= 4 for RL subset

Compute: Not reported in the paper

Comparison to Prior Work

vs. DeepSeek-R1: DeReason partitions data by difficulty (easy->SFT, hard->RL) instead of using all data for both or just cold-start.
vs. Pure RL methods: DeReason explicitly incorporates an SFT phase for knowledge acquisition, showing it outperforms pure RL on base models.
vs. Random Split SFT-then-RL: DeReason outperforms random allocation, proving difficulty-based routing matters.

Limitations

Relies on an LLM for difficulty estimation, which may be noisy or calibrated incorrectly.
Experiments limited to relatively small 4B parameter models.
Requires a verifiable reward signal (or a reliable model-based verifier), which is hard to obtain for open-ended general reasoning.
Does not explore iterative loops (SFT -> RL -> SFT -> RL).

Reproducibility

SFT uses Llama-Factory; RL uses VeRL framework. Code URL not provided. Difficulty scoring prompt is in Appendix A.1. Base model is Qwen3-4B-Base. Teacher model is Qwen3-4B-Instruct-2507.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on general STEM and Math reasoning benchmarks.

Benchmarks:

MMLU-Pro (General knowledge & reasoning (10 choices))
GPQA-Diamond (Expert-level science QA)
SuperGPQA (Broad disciplinary science QA)
BBEH (Complex multi-type reasoning (BIG-Bench Extra Hard))

Metrics:

Pass@1 (Accuracy)
Reward Score (during training)
Response Length
Policy Entropy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis of training paradigms shows SFT outperforms pure RL on base models, but the DeReason decoupled curriculum achieves the best overall performance.
GPQA-Diamond	Pass@1	Not reported in the paper	Not reported in the paper	Not reported in the paper

Experiment Figures

Comparison of SFT vs. RL performance vs. Data Size on GPQA-Diamond and Math tasks.

Evolution of response length during RL training broken down by reward score.

Actor entropy evolution during training.

Main Takeaways

SFT is consistently superior to pure RLVR for base models on general reasoning tasks, likely due to better knowledge acquisition efficiency.
Difficulty-based partitioning (DeReason) outperforms random partitioning for SFT-then-RL pipelines, validating the curriculum strategy.
RL training from a base model leads to higher entropy collapse and extreme length bifurcation compared to RL from an SFT checkpoint.
The method is effective on both Math and General STEM domains, with larger gains on reasoning-intensive benchmarks (BBEH) than knowledge-intensive ones (MMLU-Pro).

📚 Prerequisite Knowledge

Prerequisites

Understanding of Supervised Fine-Tuning (SFT) vs. Reinforcement Learning (RL)
Familiarity with reasoning benchmarks (GSM8K, MATH, GPQA)
Basics of Policy Optimization (PPO/GRPO)

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—using objective correctness signals (like math answers or passing code tests) to guide RL training.

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs for the same prompt, removing the need for a separate value network.

SFT: Supervised Fine-Tuning—training a model to mimic provided reference answers (demonstrations).

Reasoning Intensity: A metric (1-5) estimated by an LLM to classify how much multi-step derivation vs. simple knowledge recall a problem requires.

Model-based Verifier: An LLM used as a reward function to judge the correctness of free-form scientific answers where exact string matching fails.

Cold-start: The initial phase of training (usually SFT) used to give a model basic capabilities before starting reinforcement learning.

Distillation: Transferring knowledge from a larger/stronger 'teacher' model to a smaller 'student' model, typically via SFT on teacher outputs.

Pass@1: A metric measuring the percentage of problems where the model's first generated answer is correct.