Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning

📝 Paper Summary

Reinforcement Learning for LLMs Policy Optimization Algorithms

SFPO stabilizes reasoning training by splitting updates into a fast exploration trajectory on the same batch, a repositioning step to control drift, and a final slow correction.

Core Problem

On-policy RL algorithms like GRPO suffer from noisy gradients and instability in early training because one-shot updates underutilize batch data while naive reuse leads to off-policy drift.

Why it matters:

Noisy gradients from low-quality early rollouts cause training instability and inefficient exploration
Discarding batch data after a single update step is sample-inefficient, requiring excessive rollouts to converge
Simply applying multiple updates to the same batch (naive off-policy) introduces distribution mismatch that degrades performance

Concrete Example: In early training, if a batch yields weak reasoning chains with stochastic rewards, a single GRPO update might step in a high-variance direction. Naively taking multiple steps on this noisy batch moves the policy too far from the data-generating distribution (drift), damaging future convergence.

Key Novelty

Fast-Reposition-Slow Update Mechanism

Decomposes each iteration into a 'fast trajectory' of multiple inner updates to stabilize direction, followed by a 'reposition' step that interpolates back toward the original policy to curb drift
Uses an entropy-based schedule to dynamically disable the repositioning mechanism near convergence, reverting to standard on-policy updates when noise dominates signal

Architecture

Conceptual diagram of the Slow-Fast Policy Optimization update trajectory in parameter space.

Evaluation Highlights

+2.80 average accuracy improvement on math benchmarks for DeepSeek-R1-Distill-Qwen-1.5B compared to GRPO
Up to +7.5 absolute accuracy gain on the challenging AIME25 benchmark with DeepSeek-R1-Distill-Qwen-1.5B
Reduces wall-clock training time by up to 4.19x and requires 4.93x fewer rollouts to match GRPO's best accuracy

Breakthrough Assessment

7/10

Significant efficiency and stability gains over GRPO, a standard industry baseline. The method is a plug-and-play optimization improvement rather than a fundamental architectural shift.

⚙️ Technical Details

Problem Definition

Setting: Policy gradient reinforcement learning for reasoning tasks

Inputs: Prompt q

Outputs: Generated reasoning chain and answer o

Pipeline Flow

Input Prompt
LLM Generator

System Modules

LLM Generator

Generate reasoning steps and final answer

Model or implementation: Qwen2.5-Math or DeepSeek-R1-Distill-Qwen variants

Modeling

Base Model: Qwen2.5-Math-1.5B/7B, DeepSeek-R1-Distill-Qwen-1.5B/7B, Qwen3-4B-Base

Training Method: Slow-Fast Policy Optimization (SFPO)

Objective Functions:

Purpose: Optimize policy to maximize expected reward.

Formally: Standard GRPO objective with KL regularization and importance sampling clipping.
Purpose: Stabilize update direction.

Formally: Fast trajectory of K inner steps: θ_{k+1} = θ_k - η ∇L(θ_k).
Purpose: Control off-policy drift.

Formally: Reposition step: θ_{repro} = θ_{start} + α(θ_{end} - θ_{start}).
Purpose: Final update.

Formally: Slow correction: θ_{new} = θ_{repro} - η ∇L(θ_{repro}).

Training Data:

DAPO training dataset + Math training dataset (approx 24K data)
Skywork-OR1 Math RL training dataset (105K data)

Key Hyperparameters:

batch_size: 256
responses_per_question: 8
total_training_steps: 400
+ 2 more
context_length: 4096 (Qwen2.5-Math) or 8192 (others)
rollout_temperature: 1

Compute: Single 8-GPU node

Comparison to Prior Work

vs. GRPO: SFPO uses multiple inner steps per batch (data reuse) with a reposition mechanism, whereas GRPO uses a single step.
vs. Naive Data Reuse: SFPO interpolates back to the on-policy point to prevent drift, whereas naive reuse diverges.

Limitations

Relies on a heuristic entropy-based schedule to switch off the mechanism near convergence
Increases computational cost per iteration due to K inner updates (though reduces total iterations)
Performance gain depends on the quality of the 'fast trajectory' direction

Reproducibility

Code: https://slow-fast-po.github.io/

Project website available. Method is described as plug-compatible with existing pipelines (e.g., verl). Hyperparameters and datasets are specified.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks with RL fine-tuning

Benchmarks:

Math500 (Math problem solving)
AIME24 (Competition math)
AIME25 (Competition math)
AMC (Competition math)
MinervaMath (Math problem solving)
Olympiad Bench (Olympiad-level math)

Metrics:

Pass@1 Accuracy
Statistical methodology: Average accuracy reported over benchmarks

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
SFPO consistently improves average accuracy across various model sizes and families compared to the GRPO baseline.
Average (6 benchmarks)	Pass@1	47.73	50.53	+2.80
Average (6 benchmarks)	Pass@1	38.35	40.19	+1.84
Average (6 benchmarks)	Pass@1	60.47	63.04	+2.57
Average (6 benchmarks)	Pass@1	43.99	45.59	+1.60
SFPO shows outsized improvements on difficult competition benchmarks.
AIME25	Pass@1	Not explicitly reported in text, inferred from delta	Not explicitly reported in text, inferred from delta	+7.5

Experiment Figures

Training curves comparing SFPO and GRPO accuracy over training steps.

Main Takeaways

SFPO consistently outperforms GRPO across model scales (1.5B to 7B) and types (Math-specialized vs General Base).
Efficiency is substantially improved, requiring up to ~5x fewer rollouts to reach matching accuracy, addressing the high cost of RL exploration.
The method is robust to dataset scale, showing gains on both the smaller DAPO+Math (24K) and larger Skywork-OR1 (105K) datasets.
Training dynamics show faster initial convergence and higher final stability compared to GRPO.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradients)
Trust Region methods
Large Language Models

Key Terms

SFPO: Slow-Fast Policy Optimization—the proposed update rule combining inner updates, repositioning, and slow correction

GRPO: Group Relative Policy Optimization—a policy gradient method that normalizes rewards within a group of outputs for the same prompt, eliminating the need for a value function critic

Fast Trajectory: A sequence of multiple inner gradient updates performed on the same batch of data to stabilize the search direction

Reposition: An interpolation step that pulls the parameters back towards the initial on-policy point to control the distribution mismatch caused by inner updates

Slow Correction: A final gradient step applied after repositioning to align with local curvature

Pass@1: The percentage of problems where the model's first generated answer is correct

Rollout: The process of generating a complete sequence (reasoning chain + answer) from the policy given a prompt

On-policy: Learning from data generated by the current version of the policy (as opposed to old or historical data)

Off-policy drift: The discrepancy between the data distribution the model is learning from and the model's current policy distribution, which occurs when reusing data for multiple updates