Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved)

📝 Paper Summary

LLM Post-training Offline Reinforcement Learning

SFT on filtered data mathematically optimizes a lower bound on the sparse-reward RL objective, and a simple importance-weighting modification (iw-SFT) tightens this bound to improve performance.

Core Problem

SFT on curated data is effective but theoretically viewed as distinct from RL, and it optimizes a lower bound on the RL objective that loosens as the model drifts from the reference distribution.

Why it matters:

RL is notoriously difficult to tune and computationally expensive compared to SFT
Understanding SFT as a specific case of RL allows for theoretical improvements that bridge the gap between simple SFT stability and RL performance
Standard SFT cannot effectively incorporate information from failures or learn optimal policies when the reference data is suboptimal (e.g., multimodal distributions with bad modes)

Concrete Example: In a toy bandit problem where an optimal action is 'pull-right' but the reference data is 50/50 'pull-left'/'pull-right', standard SFT on successful trials results in a suboptimal policy (33% left / 66% right). iw-SFT, by reweighting successful trajectories, recovers the optimal policy (100% right).

Key Novelty

Importance Weighted Supervised Fine-Tuning (iw-SFT)

Reframes SFT on filtered data as maximizing a lower bound on the RL objective in sparse reward settings
Introduces an importance-weighting term to the SFT loss that tightens the lower bound as the model trains, effectively 'adaptive-filtering' data based on the model's current probability
Demonstrates that this simple modification allows SFT to approach RL performance without complex RL machinery (like value functions or PPO clipping)

Architecture

The iterative training process of iw-SFT.

Evaluation Highlights

Achieves 66.7% accuracy on AIME 2024 (reasoning benchmark) using iw-SFT, outperforming standard SFT on the same curated data
Achieves 64.1% on GPQA, surpassing standard SFT baselines
Outperforms state-of-the-art offline RL baselines (IQL, AWAC) on D4RL continuous control tasks (MuJoCo) using the same importance-weighted logic

Breakthrough Assessment

8/10

Provides a strong theoretical unification of SFT and RL for LLMs. The method is extremely simple to implement (just reweighting loss) yet yields significant gains on difficult reasoning benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Policy optimization in sparse reward settings (binary success/failure) using offline datasets generated by a reference policy

Inputs: A dataset of trajectories (token sequences or state-action pairs) filtered for success or quality

Outputs: An optimized policy (LLM or control policy) that maximizes expected return

Pipeline Flow

Reference Policy Data Generation / Collection
Filtering / Curation
Importance Weight Calculation
Weighted Maximum Likelihood Training

System Modules

Reference Policy / Data Source (Data Preparation)

Generates initial trajectories (or provided as a static dataset)

Model or implementation: Pre-trained LLM or Human Demonstrations

Filter / Curator (Data Preparation)

Selects successful trajectories based on sparse rewards (e.g., correct answer)

Model or implementation: Deterministic check or Quality Scorer

Importance Weighter (Training Loop)

Calculates weight w = p_theta(tau) / q(tau) for each trajectory in the batch

Model or implementation: Current Policy and Reference Policy

Policy Optimizer (Training Loop)

Updates model parameters using weighted NLL loss

Model or implementation: Target LLM or Control Policy

Novel Architectural Elements

Integration of an adaptive importance weighting mechanism directly into the standard SFT loop
Use of a time-lagged copy of the policy to estimate importance weights for stability

Modeling

Base Model: Varies by experiment (e.g., LLMs for reasoning, MLPs for control)

Training Method: Importance Weighted Supervised Fine-Tuning (iw-SFT)

Objective Functions:

Purpose: Maximize RL objective via a tighter lower bound.

Formally: J_iw-SFT(theta) = sum [ w_i * log p(tau_i | theta) ] where w_i comes from importance sampling ratios.

Key Hyperparameters:

clipping_alpha: 1.0 +/- 0.8 (for LLMs)
clipping_beta: Not explicitly reported in the paper text, implied as bound
temperature_k: Varies (alpha parameter used for smoothing in control tasks)

Compute: Requires keeping a reference model in memory (or computing probabilities beforehand) to calculate importance weights

Comparison to Prior Work

vs. SFT: Adds importance weighting to tighten the RL bound, allowing the model to recover from suboptimal reference distributions
vs. ReST: Explicitly uses importance weights q(tau)/pi_ref(tau) rather than just filtering, optimizing a tighter bound
vs. PPO: Does not require a separate value function or critic network; operates offline on curated data

Limitations

Requires a mechanism to calculate probability of trajectories under the reference policy (easy for LLMs, harder for some black-box settings)
Importance weights can have high variance, requiring clipping or smoothing heuristics
Performance depends on the quality and coverage of the initial curated dataset

Reproducibility

The paper provides the mathematical derivation and the core algorithm logic (Algorithm 1 in appendix). Specific code URL is not provided. Dataset details (AIME 2024, MATH500, D4RL) are standard open benchmarks.

📊 Experiments & Results

Evaluation Setup

LLM Reasoning and Offline Reinforcement Learning for Control

Benchmarks:

AIME 2024 (Mathematical Reasoning)
GPQA (Graduate-Level Reasoning QA)
MATH500 (Mathematical Problem Solving)
D4RL (MuJoCo) (Continuous Control (Locomotion))

Metrics:

Accuracy (pass@1)
Normalized Score (for D4RL)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
AIME 2024	Accuracy	Not reported in the paper	66.7	Not reported in the paper
GPQA	Accuracy	Not reported in the paper	64.1	Not reported in the paper
Control task results demonstrating iw-SFT effectiveness against offline RL baselines.
D4RL	Performance	Not reported in the paper	Not reported in the paper	Not reported in the paper

Experiment Figures

A toy bandit experiment comparing SFT and iw-SFT policies against the optimal policy.

Main Takeaways

iw-SFT consistently improves over standard SFT/BC by assigning higher weights to trajectories where the current policy assigns higher probability than the reference, effectively 'self-reinforcing' good behaviors.
The method is generalizable across domains, working for both discrete token generation (LLMs) and continuous control (MuJoCo).
Reasoning capabilities emerge naturally from iw-SFT on curated data without needing explicit inference-time scaling techniques like budget forcing.
The theoretical connection proves that SFT is a loose lower bound on RL, and iw-SFT tightens this bound, explaining why SFT works well but also how it can be improved.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals (policy, return, objective functions)
Supervised Fine-Tuning (SFT) / Behavior Cloning (BC)
Importance Sampling
Expectation-Maximization (EM) algorithm

Key Terms

SFT: Supervised Fine-Tuning—training a model to maximize the likelihood of the next token in a provided dataset

iw-SFT: Importance Weighted SFT—the proposed method which weights SFT examples based on the ratio of the current policy's probability to the reference policy's probability

SFT(Q): SFT from quality sampled data—a variant where data is sampled proportional to quality scores (e.g., star ratings)

RL: Reinforcement Learning—training an agent to maximize cumulative rewards through trial and error

BC: Behavior Cloning—a form of imitation learning where a policy is trained to mimic an expert's actions (equivalent to SFT)

RWR: Reward Weighted Regression—an RL algorithm that weights training examples by their rewards

KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a second, reference probability distribution

sparse reward: A setting where the agent receives non-zero feedback only rarely, often just binary success/failure at the end of a task

PPO: Proximal Policy Optimization—a popular RL algorithm that uses a clipped objective to ensure stable updates

DPO: Direct Preference Optimization—a method to align language models to preferences without explicit reward modeling

IQL: Implicit Q-Learning—an offline RL algorithm

AWAC: Advantage Weighted Actor Critic—an offline RL algorithm