← Back to Paper List

DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning

Hanxu Hu, Yuxuan Wang, Maggie Huan, Jannis Vamvas, Yinya Huang, Zhijiang Guo, Rico Sennrich
University of Zurich, University of Pennsylvania
arXiv (2026)
Reasoning RL Benchmark

📝 Paper Summary

Reinforcement Learning for Reasoning Post-training Data Curriculum
DeReason improves general reasoning by partitioning data based on difficulty, using easy samples for Supervised Fine-Tuning to build knowledge and hard samples for Reinforcement Learning to refine complex reasoning.
Core Problem
Applying pure Reinforcement Learning (RL) directly to base models for general STEM reasoning is sample-inefficient and often underperforms simple Supervised Fine-Tuning (SFT) because models lack the necessary domain knowledge foundation.
Why it matters:
  • Current trends prioritize pure RL (like DeepSeek-R1-Zero) for reasoning, but this often fails in general scientific domains where broad knowledge is prerequisite.
  • Blindly mixing easy and hard data for both SFT and RL is inefficient; easy data doesn't benefit from RL's costly exploration, while hard data is wasted in SFT if the teacher's reasoning is imperfect.
  • Acquiring domain knowledge (physics formulae, facts) is hard through trial-and-error RL, making SFT a critical but often misallocated component in modern post-training pipelines.
Concrete Example: In general STEM tasks, a base model trained with pure RL on physics problems struggles to discover correct formulae from scratch. Conversely, SFT on complex multi-step derivations often leads to rote memorization of the teacher's specific path rather than true reasoning generalization.
Key Novelty
DeReason (Difficulty-based Decoupling)
  • Use an LLM to score problem difficulty (1-5); low-difficulty problems (knowledge recall) are routed to SFT to efficiently distill domain knowledge.
  • High-difficulty problems (reasoning-intensive) are reserved for RLVR, where the model initializes from the SFT checkpoint and explores reasoning paths beyond the teacher's demonstrations.
  • Decouples the 'knowledge acquisition' phase (best done via SFT) from the 'reasoning refinement' phase (best done via RL) based on data characteristics rather than just training stages.
Evaluation Highlights
  • SFT on moderate-quality data consistently outperforms pure RLVR on base models across math and STEM benchmarks (e.g., GPQA-Diamond), challenging the 'RL is all you need' narrative.
  • DeReason curriculum (SFT on easy, RL on hard) outperforms pure SFT, pure RL, and random-split SFT-then-RL baselines on Qwen3-4B-Base.
  • On challenging benchmarks like BBEH (reasoning-focused), the decoupled pipeline yields clear improvements over SFT-only baselines, while gaps are smaller on knowledge-heavy tasks like MMLU-Pro.
Breakthrough Assessment
7/10
Provides a pragmatic, empirically grounded recipe for combining SFT and RL. While not algorithmically novel, the systematic analysis of data allocation based on difficulty offers a valuable engineering insight for post-training.
×