Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization

📝 Paper Summary

Large Language Model Reasoning Reinforcement Learning from Human Feedback (RLHF)

Klear-Reasoner improves reasoning performance in 8B models by replacing standard RL clipping with Gradient-Preserving Policy Optimization (GPPO), which retains bounded gradients for high-entropy tokens to improve exploration and convergence.

Core Problem

Standard clipping in reinforcement learning (like PPO/GRPO) indiscriminately discards gradients for tokens outside the trust region, suppressing valuable exploration signals from high-entropy tokens and slowing down learning from negative samples.

Why it matters:

High-entropy tokens often correspond to critical 'Aha!' moments or decision branches in reasoning chains; clipping them prevents the model from reinforcing these exploratory breakthroughs
When suboptimal trajectories are clipped (importance ratio < 1-epsilon), the model receives no gradient signal to correct them, forcing it to repeatedly sample bad outputs before learning
Reproducing high-performance reasoning models (like O1 or DeepSeek-R1) remains difficult due to these training instabilities and lack of public details

Concrete Example: In a math problem, if the model attempts a novel but correct step that drastically changes the policy distribution (high importance ratio), standard PPO clips the gradient to zero, effectively ignoring this 'lightbulb moment'. Similarly, if it generates a clearly wrong step with low probability, clipping prevents the model from receiving the 'don't do that' signal.

Key Novelty

Gradient-Preserving Policy Optimization (GPPO)

Instead of zeroing out gradients when the ratio between new and old policies deviates too far (clipping), GPPO keeps the gradients active but bounds them within a safe range
For positive advantages (good actions), it allows updates even if the ratio is high, preserving exploration signals from 'surprising' good moves
For negative advantages (bad actions), it ensures gradients flow back even if the ratio is low, allowing the model to quickly learn 'what not to do' without waiting for repeated sampling

Evaluation Highlights

90.5% accuracy on AIME 2024 benchmark with Klear-Reasoner-8B
83.2% accuracy on AIME 2025 benchmark
66.0% pass rate on LiveCodeBench V5 (coding tasks)

Breakthrough Assessment

8/10

Achieves state-of-the-art reasoning performance for 8B models, reportedly outperforming DeepSeek-R1-Distill-8B on difficult benchmarks like AIME via a principled improvement to the RL objective.

⚙️ Technical Details

Problem Definition

Setting: Post-training of Large Language Models (LLMs) via Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL)

Inputs: Natural language prompts (math problems, coding specifications)

Outputs: Reasoning traces (Chain-of-Thought) enclosed in <think> tags followed by final answers

Pipeline Flow

User Prompt -> Klear-Reasoner-8B -> Reasoning Trace (<think>...</think>) -> Final Answer

System Modules

Klear-Reasoner-8B

Generate long chain-of-thought reasoning and final answers

Model or implementation: Based on Qwen3-8B-Base

Modeling

Base Model: Qwen3-8B-Base

Training Method: Gradient-Preserving Policy Optimization (GPPO) initialized from Long CoT SFT

Objective Functions:

Purpose: Optimize policy while preserving gradients for clipped tokens.

Formally: Gradient includes terms where importance ratio is retained but bounded by 1+epsilon_h (for positive advantage) or 1-epsilon_l (for negative advantage), rather than zeroed out.
Purpose: Stabilize RL training by anchoring to SFT distribution.

Formally: Token-level auxiliary SFT loss L_LM on positive examples generated during rollouts.

Training Data:

SFT: 1.5 million high-quality math/code samples (OpenThoughts, NuminaMath, AceReason, etc.)
RL: 88K math samples and 18K code samples, rigorously filtered using DeepSeek-R1-0120 as a judge/verifier

Key Hyperparameters:

clip_epsilon_low: 0.2 (implied by standard)
clip_epsilon_high: 0.28 (implied by comparison to Clip-Higher)
teacher_model: DeepSeek-R1-0528 (for SFT data generation)
+ 1 more
verifier_model: DeepSeek-R1-0120 (for RL data filtering)

Compute: Not reported in the paper

Comparison to Prior Work

vs. DAPO: DAPO still clips (zeroes) gradients beyond the raised threshold; GPPO bounds them but keeps them non-zero
vs. CISPO: GPPO uses a pessimistic update rule (suppressing overly optimistic updates) derived from gradient analysis, whereas CISPO uses a heuristic correction to REINFORCE
vs. Standard PPO: GPPO prevents 'dead' gradients for suboptimal trajectories, accelerating convergence on negative samples

Limitations

The paper does not report statistical significance tests for the benchmark results.
Training compute resources (GPU hours) are not disclosed.
The method relies on high-quality teacher models (DeepSeek-R1) for data generation, which may be a bottleneck.
Baseline scores for Qwen3-8B and DeepSeek-R1-Distill on specific benchmarks are claimed to be lower but exact numbers are not tabulated in the provided text.

Reproducibility

Code and model weights are not provided in the paper (URLs absent). Training data sources are public (OpenThoughts, NuminaMath, etc.), but the specific filtered subsets and generated traces are not released.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on mathematical reasoning and code generation benchmarks

Benchmarks:

AIME 2024 (Challenging Mathematical Reasoning)
AIME 2025 (Challenging Mathematical Reasoning)
LiveCodeBench V5 (Code Generation / Competition Coding)
LiveCodeBench V6 (Code Generation / Competition Coding)

Metrics:

Accuracy (Math)
Pass@1 (Code)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

High-quality data is more effective than large diverse data for Long CoT SFT; unfiltered 'difficult' samples aid exploration while unfiltered 'simple' errors harm performance.
GPPO addresses the 'delayed convergence' problem of negative samples in RL by preserving gradients for suboptimal trajectories that standard PPO would clip.
Klear-Reasoner-8B achieves high performance on AIME (90.5% on 2024), demonstrating that 8B models can achieve strong reasoning capabilities with correct RL fine-tuning.
Soft rewards (based on pass rate of test cases) are crucial for code RL to mitigate sparse reward issues, compared to binary rewards used in math.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, GRPO)
Importance Sampling
Chain-of-Thought (CoT) Prompting
Kullback-Leibler (KL) Divergence

Key Terms

GPPO: Gradient-Preserving Policy Optimization—The proposed RL algorithm that bounds rather than discards gradients for tokens outside the importance sampling trust region

GRPO: Group Relative Policy Optimization—An RL algorithm that normalizes rewards within a group of outputs for the same prompt, removing the need for a separate value network critic

PPO: Proximal Policy Optimization—A standard RL algorithm that clips policy updates to prevent destructive large steps

SFT: Supervised Fine-Tuning—Training the model on high-quality labeled examples (reasoning traces) before RL

Importance Sampling Ratio: The ratio of the probability of an action under the current policy vs. the old policy; used to estimate updates without resampling

High-entropy tokens: Tokens where the model's probability distribution is flat (uncertain); often represent key decision points in a reasoning chain

Pass@k: A metric measuring the probability that at least one of k generated code solutions passes all test cases