CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models

📝 Paper Summary

Reinforcement Learning for Reasoning Efficient Training of LLMs

CPPO accelerates reasoning model training by pruning low-advantage completions from the GRPO process and dynamically allocating new questions to maximize GPU utilization without sacrificing accuracy.

Core Problem

Group Relative Policy Optimization (GRPO) is computationally expensive because it requires sampling and processing a large group of completions for every question to estimate baselines.

Why it matters:

Training reasoning models like DeepSeek-R1 requires massive computational resources, limiting scalability and accessibility
Standard GRPO scales linearly with the number of completions; processing 64 completions requires 192 forward passes per question, creating a major bottleneck
Not all sampled completions contribute equally to learning; processing low-value samples wastes compute

Concrete Example: In DeepSeek-Math, using 64 completions per question requires 192 forward passes (64 × 3 models). If many of these completions have near-zero advantage (e.g., they are neither clearly correct nor helpfully incorrect), processing them for gradient updates wastes time without improving the policy.

Key Novelty

Completion Pruning Policy Optimization (CPPO)

Calculate the advantage of generated completions *before* the full policy forward pass used for gradient computation
Prune completions with low absolute advantage (those that don't provide strong positive or negative reinforcement signals)
Dynamically fill the GPU batch with completions from new questions to replace pruned ones, ensuring high hardware utilization

Architecture

Comparison of the GRPO and CPPO training pipelines.

Evaluation Highlights

Achieves up to 7.98x training speedup on GSM8K with Qwen2.5-1.5B-Instruct while maintaining accuracy
Attains 3.48x speedup on the more challenging MATH dataset with Qwen2.5-7B-Instruct
Maintains or improves accuracy compared to GRPO: e.g., +2.63% accuracy on GSM8K at 87.5% pruning rate

Breakthrough Assessment

8/10

Offers a significant practical efficiency gain (3-8x speedup) for a popular new training method (GRPO) with a theoretically grounded pruning strategy. The dynamic allocation mechanism addresses the 'bucket effect' effectively.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement learning for reasoning tasks using Group Relative Policy Optimization

Inputs: Natural language math questions q

Outputs: Reasoning steps and final answer o

Pipeline Flow

Old Policy Model (samples G completions per question)
Reward Function (calculates rewards and advantages for all G completions)
Pruning Module (filters out completions with low absolute advantage)
Dynamic Allocator (fills batch with filtered completions from new questions)
Policy Model (computes gradients only on retained completions)

System Modules

Sampling Policy

Generate a group of candidate completions for a batch of questions

Model or implementation: Qwen-2.5-Instruct (Old Policy)

Reward & Advantage Calculator (Selection)

Compute rule-based rewards (format + accuracy) and derive advantages relative to group mean

Model or implementation: Deterministic Function

Pruner & Allocator (Selection)

Select top-k completions by absolute advantage; dynamically add new questions to fill GPU capacity

Model or implementation: Selection Logic

Policy Learner

Update model weights using CPPO objective on filtered completions

Model or implementation: Qwen-2.5-Instruct (Policy Model)

Novel Architectural Elements

Pre-forward pruning mechanism: filters data based on advantage *before* the expensive training forward pass
Dynamic completion allocation strategy: injects data from additional questions into the current batch to replace pruned data, maximizing parallel throughput

Modeling

Base Model: Qwen-2.5-1.5B-Instruct and Qwen-2.5-7B-Instruct

Training Method: Completion Pruning Policy Optimization (CPPO)

Objective Functions:

Purpose: Maximize expected reward while pruning low-advantage samples.

Formally: Maximize sum over filtered set I of min(ratio * A_i, clip(ratio, 1-e, 1+e) * A_i) - beta * KL, where set I contains completions with |A_i| > gamma.

Key Hyperparameters:

learning_rate: 1e-6
batch_size: 16
epochs: 1
+ 5 more
clip_epsilon: 0.2
kl_beta: 0.04
group_size: 16
max_completion_length: 1024
temperature: 1.0

Compute: Qwen-2.5-1.5B trained on 2x A800 GPUs (implied by 80GB memory mention); Qwen-2.5-7B trained on 4x A800 GPUs.

Comparison to Prior Work

vs. GRPO: CPPO selectively prunes completions based on absolute advantage before the gradient step, whereas GRPO uses all sampled completions.
vs. PPO: CPPO (like GRPO) uses group scores for baseline estimation instead of a critic model, but adds a pruning layer for efficiency.
vs. TokenSkip [not cited in paper]: TokenSkip accelerates inference by skipping tokens; CPPO accelerates training by skipping entire completions.

Limitations

Pruning heavily relies on the accuracy of the rule-based reward function; if rewards are noisy, valuable data might be discarded.
Requires a sufficient number of initial samples (group size) to calculate a meaningful baseline before pruning.
Evaluated primarily on math benchmarks; generalization to other reasoning tasks (e.g., coding, commonsense) is less explored.

Reproducibility

Code: https://github.com/lzhxmu/CPPO

Code available at https://github.com/lzhxmu/CPPO. Implemented on OpenRL and VerL frameworks. Prompts provided in Appendix C.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks using Chain-of-Thought

Benchmarks:

GSM8K (Grade school math problems)
MATH (Competition-level math problems)
AMC2023 (Out-of-distribution math competition)
AIME2024 (Out-of-distribution math competition)

Metrics:

Accuracy (greedy decoding)
Pass@1
Training Speedup (multiplier)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results on GSM8K using Qwen2.5-1.5B-Instruct showing efficiency-accuracy trade-offs at different pruning rates.
GSM8K	Accuracy	77.38	80.01	+2.63
GSM8K	Speedup	1.0	7.98	+6.98
Results on MATH using Qwen2.5-7B-Instruct demonstrating scalability to harder tasks and larger models.
MATH	Accuracy	75.26	75.95	+0.69
MATH	Speedup	1.0	3.48	+2.48
Out-of-distribution generalization checks on AMC and AIME benchmarks.
AIME2024	Pass@1	51.10	52.79	+1.69

Experiment Figures

Impact of the number of completions on Accuracy and Training Time for GSM8K.

Dynamic Completion Allocation Strategy.

Main Takeaways

Higher pruning rates (e.g., 75%-87.5%) can surprisingly yield better accuracy than using all data, likely by removing ambiguous or noisy training signals.
The method scales well: significant speedups are observed on both small (1.5B) and medium (7B) models across different datasets.
Dynamic completion allocation is crucial for realizing the theoretical speedup by preventing GPU underutilization (the bucket effect).
The approach is robust to out-of-distribution tasks (AIME/AMC), confirming that pruning does not harm generalization capabilities.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, Policy Gradients)
Group Relative Policy Optimization (GRPO)
LLM Training Dynamics

Key Terms

GRPO: Group Relative Policy Optimization—a reinforcement learning method that estimates baselines from the average score of a group of completions rather than using a separate critic model

PPO: Proximal Policy Optimization—a standard RL algorithm that uses a clipped surrogate objective to ensure stable policy updates

Advantage: A measure of how much better a specific action (completion) is compared to the average baseline performance

Completion Pruning: The process of discarding generated responses that have low information value (low absolute advantage) before performing expensive gradient computations

Bucket Effect: A phenomenon in parallel computing where the overall speed is limited by the device processing the largest workload (the 'slowest' bucket)

Pass@1: The probability that a model generates a correct answer in its first attempt

Chain of Thought (CoT): A prompting technique where the model generates intermediate reasoning steps before the final answer

vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs

KL divergence: Kullback-Leibler divergence—a statistical distance measuring how one probability distribution differs from a reference distribution