BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping

📝 Paper Summary

Off-policy Reinforcement Learning LLM Alignment Mathematical Reasoning

BAPO stabilizes off-policy LLM training by dynamically adjusting clipping bounds to balance positive and negative sample contributions, preventing gradient domination by negative samples and preserving exploration entropy.

Core Problem

Off-policy RL (using stale data) for LLMs suffers from sharp entropy decline and unstable optimization because standard clipping mechanisms block entropy-increasing updates while allowing negative samples to dominate gradients.

Why it matters:

Off-policy training improves sample efficiency and enables advanced infrastructure like partial rollouts, but current instability (gradient explosions, collapse) prevents its widespread use.
Standard PPO/GRPO clipping unintentionally suppresses exploration by filtering out low-probability positive tokens, driving the policy toward premature over-exploitation.
As data staleness increases in large-scale training (e.g., experience replay), performance degrades rapidly with existing methods.

Concrete Example: In a reasoning task, a model might generate a correct but low-probability intermediate step. Standard PPO clips this positive update because the probability ratio is outside [0.8, 1.2], blocking the 'surprise' signal needed to increase entropy. Meanwhile, incorrect long chain-of-thought traces generate massive negative gradients that aren't effectively clipped, overwhelming the optimizer.

Key Novelty

Balanced Policy Optimization with Adaptive Clipping (BAPO)

Dynamically adjusts the upper and lower clipping bounds ($c_{high}$, $c_{low}$) per batch to ensure positive samples contribute a target proportion ($ ho_0$) to the total loss.
Asymmetrically expands the clipping range for positive tokens to include valid but low-probability actions (increasing entropy) while strictly filtering excessive negative tokens to prevent gradient explosion.

Architecture

Conceptual comparison between GRPO's symmetric clipping and BAPO's adaptive asymmetric clipping.

Evaluation Highlights

BP-Math-32B BAPO achieves 87.1% on AIME 2024, outperforming proprietary o3-mini-medium (79.6%) and open-source SkyWork-OR1-32B (82.2%).
BP-Math-7B BAPO reaches 70.8% on AIME 2024, surpassing SkyWork-OR1-7B (70.2%) and achieving results comparable to Gemini-2.0 Flash-Thinking.
Maintains training stability even with 8x data staleness, whereas baseline GRPO suffers performance collapse under the same conditions.

Breakthrough Assessment

8/10

BAPO effectively solves the notorious instability of off-policy RL for LLMs, enabling highly efficient training pipelines (like partial rollouts) while achieving SOTA results against proprietary models.

⚙️ Technical Details

Problem Definition

Setting: Off-policy Reinforcement Learning for LLMs where the training policy $\pi_\theta$ differs from the rollout/behavior policy $\pi_{\theta_{old}}$.

Inputs: Prompt $x$, Response $y$, Rewards $R(x, y)$, Stale trajectories from replay buffer

Outputs: Optimized Policy $\pi_\theta$

Pipeline Flow

Generate responses (Rollout) using behavior policy
Compute Rewards and Advantages (GRPO)
Calculate Importance Sampling weights ($r_t$)
Adaptive Clipping: Adjust $c_{high}/c_{low}$ until positive tokens contribute $\rho_0$ to loss
Update Policy $\pi_\theta$ via Gradient Ascent

System Modules

LLM Policy

Generates reasoning traces and answers based on prompts

Model or implementation: DeepSeek-R1-Distill-Qwen (7B/32B) or Llama-3.2-3B

Adaptive Clipper

Dynamically calculates clipping bounds for the loss function

Model or implementation: Algorithm 1 (BAPO)

Modeling

Base Model: DeepSeek-R1-Distill-Qwen-7B and 32B; OctoThinker-Llama3.2-3B-Long-Zero

Training Method: BAPO (Balanced Policy Optimization with Adaptive Clipping)

Objective Functions:

Purpose: Optimize policy while balancing exploration and stability.

Formally: maximize $J_{BAPO}(\theta) = \mathbb{E}[\min(r_t A_t, \text{clip}(r_t, c_{low}, c_{high}) A_t)]$, where bounds are dynamic.

Adaptation: Full fine-tuning (assumed based on standard RL practices for these sizes)

Training Data:

SkyWork-OR1-RL-Data (math reasoning prompts)

Key Hyperparameters:

positive_token_contribution_rho0: 0.4
clipping_bound_movable_range: [0.6, 0.9] (low), [1.2, 3.0] (high)
bound_step_size: delta1=0.05, delta2=0.02
+ 4 more
learning_rate: 2e-6
max_response_length: 8k (preliminary) / 64k (main)
temperature: 0.6
rollouts_per_prompt: 16

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. GRPO: GRPO uses fixed symmetric clipping ([0.8, 1.2]), causing entropy collapse in off-policy settings. BAPO uses dynamic asymmetric clipping.
vs. Clip-Higher: Clip-Higher uses a static increased bound. BAPO adapts bounds per-batch based on loss contribution, ensuring balance.
vs. DAPO [not cited in paper]: DAPO uses widely relaxed constraints, whereas BAPO specifically targets the ratio of positive/negative loss contributions.

Limitations

Hyperparameters (rho0, step sizes) need to be set, though authors claim robust defaults.
Theoretical analysis relies on approximations of the covariance between log-probs and advantages.
Experiments focused primarily on mathematical reasoning tasks (AIME/MATH).

Reproducibility

Code: https://github.com/WooooDyy/BAPO

Code publicly available at github.com/WooooDyy/BAPO. Dataset SkyWork-OR1-RL-Data is public. Hyperparameters provided. Compute resources (GPU hours) not specified.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks using chain-of-thought generation.

Benchmarks:

AIME 2024 (Competition Math Reasoning)
AIME 2025 (Competition Math Reasoning)

Metrics:

Accuracy (Pass@1)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on 32B scale models showing BAPO outperforming SFT, GRPO, and comparable open-source/proprietary models.
AIME 2024	Accuracy	84.6	87.1	+2.5
AIME 2025	Accuracy	78.8	80.0	+1.2
Performance on 7B scale models showing consistent gains.
AIME 2024	Accuracy	69.2	70.8	+1.6
AIME 2025	Accuracy	59.2	62.5	+3.3
Staleness robustness check comparing BAPO against GRPO and Clip-Higher.
SkyWork-OR1-RL (Training Set)	Reward	0.20	0.62	+0.42

Main Takeaways

BAPO significantly outperforms GRPO and SFT baselines across 7B and 32B scales on challenging math benchmarks.
The method exhibits strong robustness to data staleness, maintaining high rewards even when data is 8x stale, unlike GRPO which collapses.
BAPO effectively maintains policy entropy during training, preventing the over-exploitation/collapse observed in standard off-policy RL.
The 32B BAPO model competes with or beats state-of-the-art proprietary models like o3-mini-medium and Gemini-2.5-Flash-Thinking.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals
Proximal Policy Optimization (PPO)
Importance Sampling
Policy Entropy

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a sample's reward to the group average of multiple samples for the same prompt.

Off-policy RL: Training a reinforcement learning model using data generated by a previous version of the policy (stale data), rather than the current policy.

Importance Sampling: A technique to estimate properties of a target distribution using samples from a different distribution (behavior policy) by weighting samples based on the likelihood ratio.

Policy Entropy: A measure of the randomness or exploration capability of the policy; high entropy means the model explores diverse outputs, while low entropy indicates over-exploitation.

Data Staleness: The degree to which the data used for training lags behind the current policy parameters; higher staleness means the data comes from significantly older versions of the model.

PPO: Proximal Policy Optimization—a standard RL algorithm that uses clipping to prevent the new policy from deviating too far from the old policy.

Partial Rollout: An infrastructure optimization where long sequences are generated in segments; unfinished segments are stored and resumed later, creating off-policy data.

SFT: Supervised Fine-Tuning—training the model on high-quality labeled demonstrations before applying RL.