CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Mathematical Reasoning

CLPO improves LLM reasoning by using the model's own rollout performance to dynamically build a curriculum, simplifying hard problems and diversifying medium ones for targeted self-improvement.

Core Problem

Standard RLVR methods sample training data uniformly, ignoring that models have already mastered some problems while finding others intractable, leading to inefficient exploration and limited learning.

Why it matters:

Uniform sampling wastes compute on easy problems the model has already mastered
The model struggles to learn from hard problems that are far beyond its current capabilities without guidance
Existing solutions often rely on expensive external teachers (e.g., GPT-4o) or static datasets, lacking efficient endogenous self-evolution

Concrete Example: A model might repeatedly solve 2+2 correctly (zero learning gain) while failing to solve a complex calculus problem (zero reward signal). Without intervention, it cannot bridge the gap to the hard problem or move on from the easy one.

Key Novelty

Curriculum-guided Learning for Policy Optimization (CLPO)

Treats the RL rollout phase as a diagnostic tool to measure problem difficulty in real-time based on the model's own success rate
Uses this difficulty signal to act as its own teacher: it simplifies hard problems to make them learnable and diversifies medium problems to boost generalization
Adjusts the optimization objective dynamically, applying weaker constraints (lower KL penalty) on hard problems to encourage more aggressive exploration

Architecture

The CLPO framework workflow: Rollout -> Difficulty Assessment -> Adaptive Restructuring -> Policy Optimization.

Evaluation Highlights

Achieves +6.96% average pass@1 improvement over baselines across 8 benchmarks using Qwen3-8B
Sets new SOTA pass@1 on the challenging AIME 2024 benchmark, outperforming strong baselines by 3.3%
Outperforms Critique-GRPO (which uses GPT-4o feedback) on multiple datasets without needing any external teacher

Breakthrough Assessment

8/10

Strong methodological contribution effectively integrating curriculum learning into RLVR loops. Demonstrates significant gains on hard benchmarks without external dependencies, addressing a key efficiency bottleneck in self-play.

⚙️ Technical Details

Problem Definition

Setting: Online Reinforcement Learning with Verifiable Rewards (RLVR) for reasoning tasks

Inputs: Reasoning problems q (e.g., math questions)

Outputs: Generated reasoning paths and final answers y

Pipeline Flow

Rollout & Diagnosis: Generate G responses per question → Calculate accuracy
Curriculum Construction: Classify questions as Hard, Medium, or Easy
Adaptive Problem Restructuring: Simplify Hard questions, Diversify Medium questions
Filtering: Select high-value samples (accuracy strictly between 0 and 1)
Optimization: Update policy with difficulty-aware KL penalties

System Modules

Policy Model

Generate reasoning paths and answers; updated during training

Model or implementation: Qwen3-8B

Verifier

Check correctness of generated answers to compute rewards and empirical accuracy

Model or implementation: Deterministic rule-based checker

Restructuring Module

Modify problem text based on difficulty classification

Model or implementation: Current Policy Model (Self-prompted)

Novel Architectural Elements

Integration of an online difficulty assessment loop directly into the RL data generation phase
Self-contained restructuring mechanism where the policy model modifies its own training data on-the-fly based on difficulty

Modeling

Base Model: Qwen3-8B

Training Method: CLPO (Curriculum-guided Learning for Policy Optimization), building on GRPO

Objective Functions:

Purpose: Optimize policy to maximize group-relative rewards while staying close to reference policy.

Formally: Maximize E [ min( ratio * A, clip(ratio) * A ) - beta * D_KL(pi || pi_ref) ]
Purpose: Adjust KL penalty dynamically based on problem difficulty.

Formally: beta = lambda_hard if problem is hard, else lambda_non-hard

Training Data:

DAPO-Math-17k dataset
Dynamic batch construction: B_mix = B_base U B_restructure

Key Hyperparameters:

learning_rate: 1e-6
rollout_n: 4
decoding_temperature: 1.0
+ 4 more
max_response_length: 8192
difficulty_thresholds: (0.3, 0.7)
lambda_hard: 0.3
lambda_non_hard: 1.0

Compute: 8x H20 GPUs

Comparison to Prior Work

vs. GRPO: Adds dynamic curriculum and problem restructuring; significantly outperforms on hard tasks
vs. Critique-GRPO: Achieves superior or comparable performance without expensive external critique models
vs. AdaRFT [not cited in paper]: AdaRFT adjusts difficulty via rejection sampling/filtering, whereas CLPO actively rewrites/restructures the problems themselves

Limitations

Relies on the model's own capability to restructure problems; if the model is too weak to simplify a hard problem effectively, the loop may fail
Adds computational overhead during rollout for the restructuring generation step
Difficulty thresholds (tau_hard, tau_med) are hyperparameters that may need tuning for different domains

Reproducibility

Code: https://csuking.github.io/CLPO.github.io/

Code and training scripts to be open-sourced at https://csuking.github.io/CLPO.github.io/. Hyperparameters and prompts provided in paper/appendix. Uses public datasets.

📊 Experiments & Results

Evaluation Setup

Mathematical and General Reasoning tasks

Benchmarks:

MATH-500 (Mathematical Reasoning)
AIME 2024 (Olympiad-level Math)
GPQA Diamond (General Purpose Scientific Reasoning)
TheoremQA (STEM Theorem Application)
MMLU Pro (Multi-task Language Understanding)
Olympiad Bench (Olympiad Math)
AMC23 (Math Competition)
Minerva-Math (Mathematical Reasoning)

Metrics:

pass@1
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
CLPO demonstrates consistent improvements across all evaluated benchmarks compared to strong baselines like GRPO and DAPO.
AIME 2024	pass@1	13.3	16.6	+3.3
MATH-500	pass@1	59.2	63.2	+4.0
Olympiad Bench	pass@1	26.3	31.9	+5.6
Average (8 benchmarks)	pass@1	39.86	46.82	+6.96
Ablation studies confirm the necessity of both simplification and diversification strategies.
AIME 2024	Avg@32	23.5	24.8	+1.3
AIME 2024	Avg@32	23.0	24.8	+1.8

Experiment Figures

Ablation of restructuring strategies (Hard-only, Medium-only, Both) on AIME24.

Impact of different KL penalty scaling factors (alpha) for hard problems.

Main Takeaways

Dynamic curriculum learning is superior to uniform sampling for reasoning tasks, preventing wasted effort on mastered tasks and neglect of hard ones.
Self-directed problem restructuring (Simplification for hard, Diversification for medium) effectively replaces external teacher guidance.
Difficulty-aware KL regularization (weaker constraint for hard problems) is crucial for balancing exploration of new strategies vs. exploitation of known solutions.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, GRPO)
Curriculum Learning concepts
KL Divergence
Large Language Models (LLMs)

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—a paradigm where models learn from outcomes (correct/incorrect) rather than human preference labels

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a sample's reward to the group average of multiple samples for the same prompt

Online Curriculum Learning: A method where the difficulty of training data is assessed and adjusted in real-time based on the model's current performance

Adaptive Problem Restructuring: The process of modifying training problems (simplifying or diversifying) based on their assessed difficulty to improve learning utility

pass@1: The percentage of problems where the model generates the correct answer on its first attempt

KL regularization: A penalty term in the loss function that prevents the model's policy from drifting too far from a reference policy (usually the initial model)

Rollout: The process of the model generating sequences (reasoning paths and answers) for a given set of prompts during RL training

Verifier: A deterministic function or system that checks if the model's final answer matches the ground truth