Beyond Two-Stage Training: Cooperative SFT and RL for LLM Reasoning

📝 Paper Summary

LLM Reasoning Reinforcement Learning for LLMs Post-training methodologies

BRIDGE unifies SFT and RL into a single cooperative bilevel optimization process, where SFT meta-learns to guide RL, preventing catastrophic forgetting and improving exploration.

Core Problem

The standard two-stage 'cold start' pipeline (SFT followed by RL) suffers from catastrophic forgetting, where the model loses SFT-acquired behaviors during RL, and inefficient exploration due to lack of guidance.

Why it matters:

Current reasoning models like OpenAI o1 rely on large-scale RL, but the trial-and-error nature is highly inefficient without guidance.
The decoupled nature of two-stage training causes a 'dip-then-rise' performance trajectory, wasting compute and data efficiency.
SFT alone generalizes poorly, while RL alone is slow to converge; existing methods fail to synergize their complementary strengths.

Concrete Example: In cold-start training, response lengths initially drop sharply during the RL stage before recovering (a U-shaped trajectory), indicating the model forgets expert reasoning patterns from SFT before painfully relearning them through trial and error.

Key Novelty

BRIDGE (Bilevel Reinforcement and Imitation for Diverse Generation and Exploration)

Formulates training as a bilevel optimization game where SFT is the upper-level 'teacher' and RL is the lower-level 'student'.
Uses an augmented architecture: a base model optimized by RL and a LoRA module optimized by SFT to maximize the 'cooperative gain' (improvement over RL alone).
Enables bidirectional information flow: SFT sees the RL solution and updates parameters to guide the next RL step, rather than just providing a static initialization.

Architecture

The bilevel optimization framework of BRIDGE. It illustrates the interaction between the Upper-Level (SFT) and Lower-Level (RL) objectives.

Evaluation Highlights

Achieves 44% faster training with a 13% performance gain on Qwen2.5-3B compared to baselines.
Achieves 14% faster training with a 10% improvement on Qwen3-8B compared to baselines.
Consistently outperforms SFT, RL-zero, cold-start, and alternating baselines across five math benchmarks (including MATH and OlympiadBench).

Breakthrough Assessment

8/10

Offers a mathematically grounded solution to the well-known 'alignment tax' or forgetting problem in RLHF/RLVR pipelines. The bilevel formulation is a significant conceptual advance over simple multi-task learning.

⚙️ Technical Details

Problem Definition

Setting: Optimization of a language model policy to maximize both likelihood of expert traces (SFT) and expected reward (RL)

Inputs: Input prompt x

Outputs: Reasoning trace r and final answer y

Pipeline Flow

Lower-level: RL Update (Gradient Fusion)
Upper-level: SFT Update (Maximizing Cooperative Gain)

System Modules

Base Model

Main policy network generating reasoning traces and answers

Model or implementation: Qwen2.5-3B, Llama-3.2-3B-Instruct, or Qwen3-8B-Base

LoRA Module

Auxiliary parameters providing guidance to the base model

Model or implementation: LoRA adapters attached to Base Model

Novel Architectural Elements

Augmented model architecture splitting parameters into Base (RL-optimized) and LoRA (SFT-optimized) components to enable bilevel co-adaptation

Modeling

Base Model: Qwen2.5-3B, Llama-3.2-3B-Instruct, Qwen3-8B-Base

Training Method: BRIDGE (Bilevel Optimization with Penalty-based Relaxation)

Objective Functions:

Purpose: Maximize log-likelihood of expert demonstrations.

Formally: J_SFT(θ) = E[(x,r,y)~D_SFT] [log π(r,y|x; θ)]
Purpose: Maximize expected reward from verifiable outcomes.

Formally: J_RL(θ) = E[(x,y)~D_RL, y^~π] [R(y^, y)]
Purpose: Enforce bilevel constraint via penalty.

Formally: L(θ, w) = J_RL(θ, w) - λ * ||θ - θ*(w)||^2 (simplified)
Purpose: Upper-level objective maximizing cooperative gain.

Formally: Maximize J_SFT(θ*(w)) + [J_RL(θ*(w)) - J_RL(θ_hat)]

Adaptation: LoRA used for the upper-level guidance parameters

Key Hyperparameters:

penalty_weight_lambda: Annealed from 0 to 1
RL_algorithm: Verl (PPO/GRPO implementation)

Compute: Requires less wall-clock training time than baselines (14-44% faster depending on model)

Comparison to Prior Work

vs. Cold Start: BRIDGE prevents the 'U-shaped' performance dip and catastrophic forgetting by maintaining SFT guidance throughout.
vs. Alternating SFT-RL: BRIDGE explicitly maximizes 'cooperative gain' (SFT helps RL) via bilevel optimization, whereas alternating just switches updates independently.
vs. RL-zero: BRIDGE converges significantly faster due to guided exploration from the SFT objective.
+ 1 more
vs. MAML [not cited in paper]: BRIDGE splits parameters (Base vs. LoRA) to prevent the lower-level solution from collapsing to a single gradient step, ensuring true RL optimization occurs.

Limitations

Penalty-based relaxation is an approximation of the true bilevel problem.
Requires carefully tuning the penalty weight annealing schedule.
Currently validated primarily on math reasoning benchmarks; generalization to other reasoning tasks (e.g., coding, commonsense) is less explored in the main results.

Reproducibility

SFT data from DeepMath-103k; RL data from LIMR and MATH. RL training uses the Verl framework. Code availability is not explicitly provided in the paper text.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks with verifiable binary rewards (correct/incorrect).

Benchmarks:

MATH500 (Standard Math Reasoning)
Minerva Math (Math Reasoning)
OlympiadBench (Competition Math)
AIME 2024 (Competition Math)
AMC 2023 (Competition Math)

Metrics:

Accuracy (pass@1)
Training Wall-clock Time
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
BRIDGE demonstrates superior efficiency and performance on Qwen models compared to baselines.
Qwen2.5-3B Training	Training Time Reduction	1.0	0.56	-0.44
Qwen2.5-3B Aggregate	Performance Gain	0.0	0.13	+0.13
Qwen3-8B Training	Training Time Reduction	1.0	0.86	-0.14
Qwen3-8B Aggregate	Performance Gain	0.0	0.10	+0.10

Experiment Figures

Comparison of training dynamics (Response Length and Reward) for Cold-Start vs. RL-zero.

Test accuracy evolution during training for SFT, RL, Cold-Start, and the Alternating Baseline.

Main Takeaways

SFT and RL have complementary strengths: SFT provides rapid initial learning, while RL provides better asymptotic performance.
Cold-start (Two-Stage) training is suboptimal because the model forgets SFT patterns during the early RL phase (U-shaped response length curve).
BRIDGE consistently achieves the highest average accuracy across all five math benchmarks compared to SFT, RL-zero, Cold-start, and Alternating methods.
The method proves effective across multiple model families (Qwen and Llama) and sizes (3B and 8B).

📚 Prerequisite Knowledge

Prerequisites

Bilevel Optimization
Reinforcement Learning (PPO/GRPO)
Supervised Fine-Tuning (SFT)
Low-Rank Adaptation (LoRA)

Key Terms

SFT: Supervised Fine-Tuning—training a model to imitate expert demonstrations (prompts + answers)

RLVR: Reinforcement Learning with Verifiable Rewards—RL where correctness is determined by a rule-based checker (e.g., math answers)

Bilevel Optimization: A mathematical problem where one problem (upper-level) contains another problem (lower-level) as a constraint

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes the main model weights and trains small rank-decomposition matrices

Cooperative Gain: The performance advantage of joint SFT-RL training over RL training alone, explicitly maximized by the BRIDGE upper-level objective

Danskin's Theorem: A mathematical theorem used to compute gradients of functions defined by maximization problems, used here to differentiate through the RL step

PPO: Proximal Policy Optimization—a standard RL algorithm used to update the policy

GRPO: Group Relative Policy Optimization—an RL algorithm used effectively in reasoning models like DeepSeek-R1

Cold Start: The standard practice of training a model with SFT first to provide a good initialization before starting RL

Catastrophic Forgetting: The tendency of a neural network to completely and abruptly forget previously learned information upon learning new information

SFT-RL Alternating: A simple baseline introduced in the paper that switches between SFT and RL updates without the cooperative bilevel objective