On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning

📝 Paper Summary

Reinforcement Learning for LLMs Mathematical Reasoning

RPG unifies KL-regularized policy gradient methods by deriving exact off-policy gradients for unnormalized KL divergences and stabilizing training via a clipped importance-weighted REINFORCE estimator.

Core Problem

Existing KL-regularized methods like GRPO often use ad-hoc estimators (e.g., k3) that lack correct importance weighting for off-policy sampling, leading to gradients that do not match the intended objective.

Why it matters:

Current methods suffer from high variance or mathematical inconsistencies when training on data sampled from older policies (off-policy setting)
Stability is critical for scaling RL to long-context reasoning tasks where exact on-policy sampling is computationally expensive
Misaligned gradients in methods like GRPO can cause instability or suboptimal convergence in mathematical reasoning tasks

Concrete Example: In GRPO, the KL penalty is estimated using the k3 estimator on samples from an old policy without an importance weight. This means the optimization direction is mathematically mismatched to the true KL-regularized objective, potentially leading to destructive updates.

Key Novelty

Regularized Policy Gradient (RPG) Framework

Unifies normalized and unnormalized KL variants (Forward/Reverse) under a single derivation, proving the popular k3 estimator is exactly the unnormalized Reverse KL
Identifies and corrects the missing importance weight in GRPO's KL term, deriving a surrogate loss that yields the exact gradient of the intended objective
Introduces RPG-Style Clip, a dual-clipped REINFORCE estimator that stabilizes off-policy updates by bounding importance ratios based on the sign of the regularized advantage

Architecture

Overview of the iterative Regularized Policy Gradient (RPG) framework and its core engine.

Evaluation Highlights

Achieves 52.08% accuracy on AIME25 with Qwen3-4B (8K context), surpassing the official Qwen3-4B-Instruct model (47%)
RPG-REINFORCE outperforms the strong DAPO baseline by +4.68 percentage points on AIME25 and +2.18 points on AIME24
Demonstrates superior stability in reward and entropy curves compared to GRPO, which suffers from higher volatility due to incorrect weighting

Breakthrough Assessment

9/10

Provides a rigorous theoretical unification of scattered KL-regularized methods and corrects a fundamental weighting error in the widely used GRPO, while delivering state-of-the-art reasoning performance.

⚙️ Technical Details

Problem Definition

Setting: Optimization of expected reward with KL regularization under off-policy sampling

Inputs: Prompt x, Reference Policy π_old, Current Policy π_θ

Outputs: Updated Policy Parameters θ

Pipeline Flow

Prompt Sampling
Generation (Policy Rollout)
Reward Calculation
RPG Update (Gradient Estimation)

System Modules

Policy Model

Generates reasoning chains and answers given a prompt

Model or implementation: Qwen3-4B / Qwen2.5-7B-Instruct

Modeling

Base Model: Qwen3-4B and Qwen2.5-7B-Instruct

Training Method: RPG-REINFORCE (Regularized Policy Gradient with REINFORCE estimator)

Objective Functions:

Purpose: Optimize reward while penalizing deviation from reference, using Unnormalized Forward KL (UFKL).

Formally: L(θ) = E[-w(x)R(x) + β(w(x) - log w(x) - 1)] where w(x) = π_θ(x)/π_old(x).
Purpose: Optimize reward while penalizing deviation from reference, using Unnormalized Reverse KL (URKL).

Formally: L(θ) = E[-w(x)R(x) + β(w(x) log w(x) - w(x))].
Purpose: Stabilize REINFORCE updates off-policy via clipping.

Formally: L(θ) = -E[C(w(x), SG(A_reg)) log π_θ(x)] where C is the Dual-Clip operator and A_reg is the regularized advantage.

Adaptation: Full fine-tuning

Trainable Parameters: All parameters of the LLM

Training Data:

DAPO-Math-17k dataset (13.9k English samples)
Evaluated on AIME24, AIME25, AMC23

Key Hyperparameters:

clip_epsilon_1: 0.2 (RPG/DAPO)
clip_epsilon_2: 0.28 (RPG/DAPO)
learning_rate: Not reported in the paper (refers to Appendix H)
+ 1 more
kl_beta: Not explicitly reported in main text

Compute: Uses vLLM engine for inference; implementation avoids keeping full reference model in memory by pre-computing probabilities

Comparison to Prior Work

vs. GRPO: RPG adds correct importance weights to the KL term for valid off-policy gradients; GRPO omits them.
vs. DAPO: RPG achieves higher accuracy and stability through mathematically grounded KL regularization and dual-clipping.
vs. PPO: RPG-REINFORCE simplifies the PPO-Clip objective by integrating the KL term directly into the advantage/weight and using stop-gradients, often avoiding a separate value network for the KL part.

Limitations

RPG-Style Clip introduces a bias-variance trade-off controlled by epsilon hyperparameters that requires tuning
Iterative reference updates are needed to balance stability and plasticity, adding a schedule to manage
Experiments focus on math reasoning; generalization to other domains (coding, creative writing) is less explored

Reproducibility

Code: https://github.com/complex-reasoning/RPG

Code is publicly available at https://github.com/complex-reasoning/RPG. The paper provides detailed derivations in appendices. Reference model update frequency and specific beta schedules are discussed in implementation details.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning with Chain-of-Thought

Benchmarks:

AIME24 (Math Competition Problems)
AIME25 (Math Competition Problems)
AMC23 (Math Competition Problems)

Metrics:

Accuracy (Pass@1)
Mean@32 (Average accuracy of 32 sampled responses)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on AIME24 and AIME25 benchmarks with 8K context length, showing RPG variants outperforming baselines.
AIME24	Accuracy (Best)	57.40	61.77	+4.37
AIME24	Accuracy (Best)	43.96	61.77	+17.81
AMC23	Accuracy (Best)	94.30	95.39	+1.09

Experiment Figures

Visualization of the RPG-Style Clip (Dual-Clip) loss term vs. importance weight w(x).

Training dynamics (Accuracy, Reward, Entropy, Response Length) on AIME24/25 with 8K context.

Main Takeaways

RPG-REINFORCE with RPG-Style Clip consistently outperforms GRPO and DAPO on challenging math benchmarks (AIME24/25), demonstrating the value of correct off-policy weighting.
The method scales effectively to 8K context lengths, achieving 52% on AIME25, surpassing the base Qwen3-4B-Instruct model's 47%.
Training dynamics (reward, entropy) are significantly more stable with RPG compared to GRPO, which exhibits volatility likely due to its mismatched KL gradient estimator.

📚 Prerequisite Knowledge

Prerequisites

Policy Gradient Theorem
Importance Sampling
Kullback-Leibler (KL) Divergence
Proximal Policy Optimization (PPO)

Key Terms

RPG: Regularized Policy Gradient—a framework deriving exact gradients for KL-regularized objectives under off-policy sampling

GRPO: Group Relative Policy Optimization—a PPO variant that normalizes advantages within a group of outputs for the same prompt, typically without a value network

k3 estimator: A specific estimator for KL divergence (y - log y - 1) used in PPO and GRPO, which RPG proves is equivalent to Unnormalized KL

UKL: Unnormalized KL Divergence—a generalized KL formulation that accounts for probability distributions that do not sum to 1

Importance Sampling: A technique to estimate properties of a target distribution while sampling from a different (proposal) distribution by weighting samples by the ratio of their probabilities

REINFORCE: A fundamental policy gradient algorithm that updates policies based on the return of complete trajectories

DAPO: Direct Alignment Policy Optimization—a recent baseline algorithm for aligning LLMs

SFT: Supervised Fine-Tuning—training on labeled data before RL

Dual-Clip: A clipping strategy that bounds importance weights differently depending on whether the advantage is positive or negative