← Back to Paper List

Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

Dylan Zhang, Yufeng Xu, Haojin Wang, Qingzhi Chen, Hao Peng
University of Illinois Urbana-Champaign
arXiv (2026)
RL Reasoning Pretraining

📝 Paper Summary

LLM Post-training Reasoning
PEAR improves the transition from supervised fine-tuning to reinforcement learning by reweighting offline data based on how likely the target policy is to generate those sequences, correcting distribution mismatches.
Core Problem
Standard supervised fine-tuning (SFT) optimizes for offline accuracy in isolation, but models that perform well offline often fail to improve during subsequent reinforcement learning (RL) due to a distribution mismatch between the data-generating policy and the training policy.
Why it matters:
  • Gains in offline SFT accuracy frequently disappear or reverse after RL, making traditional SFT metrics misleading proxies for final performance
  • The 'behavior policy' (offline data) often contains reasoning paths that the 'target policy' (the model being trained) finds unlikely, causing the model to learn dead-end transitions that hurt online exploration
  • Current pipelines treat SFT and RL as separate stages, ignoring the crucial offline-to-online shift that dictates RL headroom
Concrete Example: In a logic puzzle, standard SFT treats all correct training traces equally. However, if the current model finds the first step of a specific trace highly improbable, forcing it to learn the subsequent steps creates a 'broken' reasoning path. Later, during RL, the model cannot effectively revisit or improve upon this path because the prefix is effectively unreachable under its own policy.
Key Novelty
Policy Evaluation-inspired Algorithm for Offline Learning Loss Reweighting (PEAR)
  • View the transition from SFT to RL as an 'off-policy evaluation' problem where we must correct for the difference between the data source and the model's current behavior
  • Instead of treating all training tokens equally, down-weight tokens that lead to futures the current model considers unlikely, and up-weight paths the model can actually generate
  • Apply this reweighting (via importance sampling) directly to the SFT loss without changing the underlying training objective or requiring new data
Evaluation Highlights
  • +14.6% Pass@8 on AIME-2025 using Qwen3-1.7B-Base compared to standard SFT initialization
  • +40% absolute accuracy on synthetic logic games compared to standard SFT initialization after identical RL training
  • Consistent post-RL gains across 6 different models (including Qwen2.5-Math and DeepSeek-Distill) on hard math benchmarks like MATH-500 and AIME-2024
Breakthrough Assessment
8/10
Identifies a critical, overlooked flaw in the standard SFT-then-RL pipeline (offline-online mismatch) and provides a theoretically grounded, highly effective fix that works across model scales.
×