Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

📝 Paper Summary

LLM Post-training Reasoning

PEAR improves the transition from supervised fine-tuning to reinforcement learning by reweighting offline data based on how likely the target policy is to generate those sequences, correcting distribution mismatches.

Core Problem

Standard supervised fine-tuning (SFT) optimizes for offline accuracy in isolation, but models that perform well offline often fail to improve during subsequent reinforcement learning (RL) due to a distribution mismatch between the data-generating policy and the training policy.

Why it matters:

Gains in offline SFT accuracy frequently disappear or reverse after RL, making traditional SFT metrics misleading proxies for final performance
The 'behavior policy' (offline data) often contains reasoning paths that the 'target policy' (the model being trained) finds unlikely, causing the model to learn dead-end transitions that hurt online exploration
Current pipelines treat SFT and RL as separate stages, ignoring the crucial offline-to-online shift that dictates RL headroom

Concrete Example: In a logic puzzle, standard SFT treats all correct training traces equally. However, if the current model finds the first step of a specific trace highly improbable, forcing it to learn the subsequent steps creates a 'broken' reasoning path. Later, during RL, the model cannot effectively revisit or improve upon this path because the prefix is effectively unreachable under its own policy.

Key Novelty

Policy Evaluation-inspired Algorithm for Offline Learning Loss Reweighting (PEAR)

View the transition from SFT to RL as an 'off-policy evaluation' problem where we must correct for the difference between the data source and the model's current behavior
Instead of treating all training tokens equally, down-weight tokens that lead to futures the current model considers unlikely, and up-weight paths the model can actually generate
Apply this reweighting (via importance sampling) directly to the SFT loss without changing the underlying training objective or requiring new data

Evaluation Highlights

+14.6% Pass@8 on AIME-2025 using Qwen3-1.7B-Base compared to standard SFT initialization
+40% absolute accuracy on synthetic logic games compared to standard SFT initialization after identical RL training
Consistent post-RL gains across 6 different models (including Qwen2.5-Math and DeepSeek-Distill) on hard math benchmarks like MATH-500 and AIME-2024

Breakthrough Assessment

8/10

Identifies a critical, overlooked flaw in the standard SFT-then-RL pipeline (offline-online mismatch) and provides a theoretically grounded, highly effective fix that works across model scales.

⚙️ Technical Details

Problem Definition

Setting: Two-stage post-training: Offline Supervised Fine-Tuning (SFT) on static dataset D followed by Online Reinforcement Learning (RL)

Inputs: Prompt x and response sequence y from behavior policy π_beta

Outputs: Optimized policy π_theta capable of effective online exploration

Pipeline Flow

Input: Offline Dataset D = {(x, y)}
PEAR Weight Calculation (computes importance weights G_t)
Weighted SFT Training (optimizes π_theta using weighted loss)
Output: Initialized Checkpoint π_0
Online RL (optimizes π_RL starting from π_0)

System Modules

PEAR Weight Calculation (Offline Training)

Compute importance weights for each token based on the ratio between target and behavior policies

Model or implementation: Same as target model being trained

Weighted SFT Training (Offline Training)

Update model parameters using reweighted loss

Model or implementation: Target LLM (e.g., Qwen3)

Online RL

Further optimize the model using self-generated rollouts

Model or implementation: Target LLM (initialized from PEAR checkpoint)

Novel Architectural Elements

Integration of off-policy evaluation (OPE) importance weights directly into the SFT loss function
Three variants of weighting granularity: Sequence-level, Token-level (suffix-based), and Block-level

Modeling

Base Model: Qwen3-Base (0.6B, 1.7B, 4B, 8B), Qwen2.5-1.5B-Math, DeepSeek-Distill-Qwen-1.5B

Training Method: Weighted Supervised Fine-Tuning (SFT) followed by GRPO (RL)

Objective Functions:

Purpose: Reweight standard SFT loss to correct distribution mismatch.

Formally: L(theta) = E[ G_t * l_theta(x, y_<t, y_t) ] where G_t is the importance weight derived from likelihood ratios.
Purpose: Calculate importance weights based on suffix likelihood ratios.

Formally: G_t = gamma^(T-t) * Product_{j=t+1 to T} (pi_theta(y_j|...) / pi_beta(y_j|...))
Purpose: Optional negative data repulsion.

Formally: L_neg = E[ - G_t_neg * l_theta(x, y_neg) ]

Training Data:

Logic Puzzles: 100k correct trajectories from SynLogic/Enigmata generated by Qwen3-8B
Math: 100k question-response pairs from SYNTHETIC-2 dataset (math subset)
RL Data: DAPO-17k dataset

Key Hyperparameters:

sft_learning_rate: 3e-5 (games), 1e-5 (math)
sft_epochs: 1
pear_discount_factor_gamma: 0.999
+ 6 more
pear_clip_log_G: [-10, 5]
pear_clip_log_ratio: [-0.08, 0.3]
rl_learning_rate: 1e-6
rl_batch_size: 128
rl_kl_coefficient: 0.01
rl_algorithm: GRPO

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard SFT: PEAR reweights loss based on target-behavior mismatch, focusing on trajectories the model can actually reproduce
vs. TopLogP/TALR: These optimize for offline metrics (confidence/difficulty) in isolation; PEAR optimizes for online RL readiness via importance sampling
vs. One-step Weighting: PEAR uses suffix (continuation) ratios, accounting for the long-term viability of a token, whereas one-step methods are myopic

Limitations

Requires access to the behavior policy probabilities (pi_beta), which must be stored or computed
Importance sampling weights can have high variance, necessitating clipping and stabilization techniques
Currently evaluated primarily on reasoning tasks (math, logic puzzles); generalization to other domains like creative writing is untested

Reproducibility

Datasets used (SynLogic, Enigmata, SYNTHETIC-2, DAPO-17k) are publicly available or described. Hyperparameters for PEAR (clipping, gamma) and RL (GRPO settings) are provided. Code URL is not explicitly provided in the text.

📊 Experiments & Results

Evaluation Setup

Offline SFT followed by Online RL (GRPO) on reasoning tasks

Benchmarks:

SynLogic (Synthetic Logic Puzzles)
Enigmata (Harder Logic Puzzles)
MATH-500 (Mathematical Reasoning)
AIME-2024 / AIME-2025 (Competition Math)
AMC-2023 (Competition Math)

Metrics:

Pass@1
Pass@8
Accuracy (post-RL)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Logic Puzzle Results: PEAR significantly improves post-RL performance compared to baselines.
Logic Games (SynLogic)	Accuracy (Post-RL)	55.0	95.0	+40.0
Math Benchmark Results: Consistent gains on standard math competitions using Qwen3-1.7B-Base.
AIME-2025	Pass@8	4.6	19.2	+14.6
AIME-2024	Pass@8	12.8	20.8	+8.0
MATH-500	Pass@1	51.8	56.0	+4.2
Distilled Model Results: PEAR improves even strong, distilled models.
MATH-500	Pass@1	63.2	66.4	+3.2
Ablation Study: Negative data and KL-distillation compatibility.
Enigmata	Pass@1 (Post-RL)	47.0	54.0	+7.0

Experiment Figures

Conceptual illustration of the distribution mismatch.

Ablation on weighting strategies and negative data.

Main Takeaways

Offline performance is a poor predictor of online RL performance; some methods (like TopLogP) improve offline metrics but degrade post-RL results.
PEAR consistently outperforms SFT and other reweighting baselines (TALR, One-step) across diverse model sizes (0.6B to 8B) and benchmarks.
The benefits of PEAR transfer across domains: models trained offline on one distribution (SynLogic) transfer better to a shifted online distribution (Enigmata).
Token-level suffix weighting (PEAR B=1) is generally the most effective variant, though sequence-level weighting also performs surprisingly well.
PEAR reduces parameter drift during RL compared to SFT, suggesting the initialization is 'closer' to the optimal RL solution.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) concepts (policy, rollouts, importance sampling)
Language Model Post-training (SFT, KL divergence)
Off-Policy Evaluation (OPE)

Key Terms

SFT: Supervised Fine-Tuning—training a model to mimic a static dataset of correct examples

RL: Reinforcement Learning—training a model to maximize a reward signal by generating its own data (rollouts) and learning from feedback

behavior policy: The policy (or distribution) that generated the static offline dataset used for SFT (denoted as π_beta)

target policy: The policy currently being trained (denoted as π_theta)

importance sampling: A statistical technique used to estimate properties of one distribution while sampling from another by reweighting samples based on their likelihood ratio

OPE: Off-Policy Evaluation—estimating the value or performance of a target policy using data collected by a different behavior policy

GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes the policy by comparing a group of outputs for the same input and reinforcing the better ones

Pass@K: A metric measuring the probability that at least one correct answer is generated out of K independent samples

NLL: Negative Log-Likelihood—the standard loss function used in language modeling and SFT

KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a second, reference probability distribution