Rethinking Expert Trajectory Utilization in LLM Post-training

📝 Paper Summary

LLM Post-training Mathematical Reasoning

The Sequential SFT-then-RL pipeline outperforms synchronized approaches by using large-scale SFT to establish a performance foundation that maximizes the subsequent plasticity available for Reinforcement Learning.

Core Problem

It is unclear how to optimally utilize expert trajectories (SFT data) to maximize reasoning performance, with conflicting evidence between Synchronized SFT-RL (mixing imitation into RL) and Sequential SFT-then-RL paradigms.

Why it matters:

Synchronized methods claim efficiency but rely on limited data (~46K), raising doubts about their robustness at scale
Practitioners rely on Sequential SFT-then-RL empirically without rigorous guidelines on the optimal timing for switching phases
The 'Less is More' data hypothesis suggests minimal SFT data is sufficient, but it is unknown if this limits the model's potential for subsequent RL scaling

Concrete Example: Synchronized methods like SRFT integrate imitation loss directly into the RL loop to boost efficiency. However, when scaled to large datasets (889K samples), these methods often exhibit instability or lower performance ceilings compared to simply fine-tuning on the data first and then running RL.

Key Novelty

Plasticity-Ceiling Framework

Decomposes final performance into two measurable components: the realized SFT Performance (foundation) and the remaining RL Plasticity (potential for further growth)
Demonstrates that a robust SFT phase is necessary to maximize the starting foundation, which contradicts 'Less is More' by showing that more SFT data increases the final ceiling
Identifies 'mild overfitting' in SFT as the optimal signal to switch to RL, ensuring the foundation is maximized without destroying plasticity

Evaluation Highlights

Benchmarked on 6 mathematical datasets (including GSM8K, MATH, and OlympiadBench) using Qwen2.5-7B and Llama3.2-3B
Constructed and evaluated on a large-scale SFT dataset of 889K distilled DeepSeek trajectories to test scaling limits
Refutes the 'Less is More' hypothesis for the final ceiling, showing that SFT data scale determines primary potential while difficulty acts as a multiplier

Breakthrough Assessment

8/10

Provides a theoretical framework and rigorous empirical scaling laws for the SFT-then-RL pipeline, resolving a major industry debate about post-training paradigms and data efficiency.

⚙️ Technical Details

Problem Definition

Setting: Mathematical reasoning generation via prompt-solution pairs

Inputs: Reasoning problem prompt q

Outputs: Step-by-step reasoning trajectory τ and final answer a

Modeling

Base Model: Qwen2.5-7B (primary), Llama3.2-3B (validation)

Training Method: Sequential SFT-then-RL (Primary); compared against Syn-SFT-RL (LUFFY, SRFT, UPT)

Objective Functions:

Purpose: SFT imitation.

Formally: Maximize log-likelihood of expert trajectory tokens given prompt.
Purpose: GRPO RL.

Formally: Maximize group-normalized advantage with KL divergence penalty to reference policy.
Purpose: LUFFY Joint Loss.

Formally: Weighted sum of SFT loss and unconstrained GRPO-style advantage on mixed data.
Purpose: SRFT Loss.

Formally: Dynamic weighted combination of SFT loss, off-policy loss, and on-policy positive/negative likelihoods.

Training Data:

SFT889K: 889K unique math trajectories distilled from DeepSeek outputs
RL62K: 62K prompts from Skywork-OR1-RL
Subsets: S1K (1K high-quality), Uniform/Easy/Hard102K (difficulty-stratified)

Key Hyperparameters:

generation_temperature: 0.7
generation_top_p: 1.0
max_generation_length: 8192 tokens
+ 1 more
LUFFY_importance_shaping_gamma: 0.1

Compute: Not reported in the paper

Comparison to Prior Work

vs. Syn-SFT-RL (UPT, SRFT, LUFFY): The proposed Sequential pipeline separates the phases to maximize SFT foundation first, avoiding the instability and scaling limits of synchronized methods
vs. Pure RL (GRPO): Sequential SFT-then-RL utilizes expert trajectories to set a higher starting point, unlocking greater total plasticity than RL alone
vs. 'Less is More' SFT [not cited in paper]: Refutes the idea that minimal SFT data is sufficient for reasoning; demonstrates that large-scale SFT is crucial for maximizing the final RL ceiling

Limitations

Quantitative performance tables were not included in the provided text snippet (analysis relies on textual claims)
Focuses primarily on mathematical reasoning; generalization to other domains (coding, creative writing) is unverified
Relies on distilled data from a stronger model (DeepSeek), which may influence the specific plasticity characteristics

Reproducibility

Code: https://github.com/LINs-lab/RETU

Code is publicly available. SFT data is derived from DeepSeek distillation. RL prompts are from Skywork-OR1-RL. Exact training compute hours are not reported.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning on standard benchmarks using pass@1 accuracy

Benchmarks:

GSM8K (Grade school math)
MATH (Challenging competition math)
OlympiadBench (Olympiad-level math)
Minerva (Science and math reasoning)
AIME24/25 (American Invitational Mathematics Examination)

Metrics:

Pass@1 Accuracy (temperature 0.7)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Sequential Paradigm Dominance: The Sequential SFT-then-RL pipeline consistently achieves a higher performance ceiling than Synchronized approaches (LUFFY, SRFT, UPT), which are prone to instability.
Optimal Transition: The best time to switch from SFT to RL is at the 'Stable or Mild-Overfitting' phase of SFT validation loss, contrary to common practice of early stopping to avoid overfitting.
Scale Over Difficulty: Data scale is the primary driver of the post-training ceiling; while harder data acts as a multiplier, it cannot compensate for a lack of volume (refuting 'Less is More' for final ceiling).
Predictive Metric: The minimum SFT validation loss is a robust indicator of the final post-training performance ceiling.

📚 Prerequisite Knowledge

Prerequisites

Supervised Fine-Tuning (SFT) for LLMs
Reinforcement Learning (RL) with PPO or GRPO
Concepts of overfitting and model plasticity

Key Terms

SFT: Supervised Fine-Tuning—training the model to imitate expert solution trajectories

RL: Reinforcement Learning—training the model to explore and optimize rewards (correct answers)

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of outputs for the same prompt, removing the need for a critic model

Syn-SFT-RL: Synchronized SFT-RL—approaches that combine SFT imitation loss and RL reward loss into a single joint optimization loop

Plasticity: The remaining capacity of a model to improve its performance during the RL phase after being fine-tuned

DAPO: An enhanced RL algorithm based on GRPO that uses asymmetric clipping and dynamic difficulty sampling for better stability

SRFT: A synchronized method combining SFT loss, off-policy exploration, and on-policy rejection sampling

LUFFY: A synchronized method optimizing a mixture of off-policy expert data and on-policy generated data

UPT: A method that gates between SFT and RL objectives based on the model's current reward performance