Efficient Reinforcement Finetuning via Adaptive Curriculum Learning

📝 Paper Summary

Reinforcement Finetuning (RFT) Curriculum Learning Mathematical Reasoning

AdaRFT accelerates mathematical reasoning training by dynamically adjusting the difficulty of training problems to maintain a 50% model success rate, ensuring samples are neither too easy nor too hard.

Core Problem

Standard Reinforcement Finetuning is compute-inefficient because models waste resources training on problems they can already solve easily or problems that are currently impossible for them.

Why it matters:

RFT is computationally expensive due to repeated rollout generation and reward computation
Static datasets or fixed curricula fail to adapt to the model's evolving capabilities, leading to suboptimal learning rates
Existing filtering methods (removing easy/hard samples) are rigid, require repeated expensive rollouts, or rely on brittle manual thresholds

Concrete Example: A model capable of solving high school algebra might waste thousands of training steps iterating on elementary arithmetic (zero learning signal) or International Math Olympiad problems (zero reward), resulting in slow convergence.

Key Novelty

Adaptive Curriculum Reinforcement Finetuning (AdaRFT)

Maintains a 'target difficulty' scalar that represents the ideal problem difficulty for the model's current skill level
Updates this target dynamically using a feedback loop: if recent rewards are high, increase target difficulty; if low, decrease it
Samples training batches by selecting problems from the dataset whose estimated difficulty scores are closest to the current target

Architecture

Pseudocode for AdaRFT showing the loop of difficulty estimation, batch selection, PPO update, and target difficulty adjustment.

Evaluation Highlights

Reduces training time by up to 2× compared to standard baselines (Abstract)
Demonstrates high correlation (Pearson r = -0.34) between estimated difficulty scores and ground-truth model success rates
Achieves robust difficulty estimation with as few as 64 rollouts (estimates remain within ±0.05 of ground truth >90% of the time)

Breakthrough Assessment

7/10

A theoretically grounded, lightweight improvement to standard RFT that addresses a major efficiency bottleneck. While the core idea of 'train on the frontier' is classic, the adaptive implementation for LLMs is practical and effective.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Finetuning of a policy model on a dataset of mathematical problems with binary correctness rewards

Inputs: Dataset of problems D with difficulty scores d_i, initial policy π_θ

Outputs: Optimized policy π_θ capable of solving mathematical reasoning tasks

Pipeline Flow

Policy Model (Generates solution)
Reward Mechanism (Checks correctness)

System Modules

Policy Model

Generate mathematical solutions given a problem statement

Model or implementation: Qwen 2.5 (7B or 1.5B variants)

Modeling

Base Model: Qwen 2.5 7B and Qwen 2.5 MATH 1.5B

Training Method: AdaRFT (wrapping PPO)

Objective Functions:

Purpose: Dynamically update the target difficulty T to maintain a specific success rate.

Formally: T ← clip(T + η · tanh(α(R_avg - β)))
Purpose: Standard PPO policy update.

Formally: Standard clipped surrogate objective L_CLIP

Training Data:

DeepScaleR dataset (AIME, AMC, Omni-MATH, Still)
Difficulty estimation using Qwen 2.5 MATH 7B (pass@128)
Data split into three distributions: skew-difficult, skew-easy, and uniform (10,000 samples each)

Key Hyperparameters:

target_reward_beta: 0.5 (aiming for 50% success rate)
step_size_eta: 50
sensitivity_alpha: 2
+ 3 more
batch_size: 1024
samples_for_difficulty_est: 128 (rollouts per problem)
compute_resources: 8 A100 GPUs

Compute: 8 A100 GPUs for ~100 steps

Comparison to Prior Work

vs. Fixed Curriculum (Staged): AdaRFT adjusts difficulty continuously based on real-time reward feedback rather than pre-defined stages.
vs. Online Filtering (Bae et al.): AdaRFT uses a lightweight scalar update rule instead of expensive repeated rollouts to find suitable problems.
vs. SPL (Self-Paced Learning) [not cited in paper]: AdaRFT targets a specific reward margin (0.5) rather than just minimizing loss, maximizing gradient variance for RL.

Limitations

Relies on accurate initial difficulty estimation (requires rollouts or GPT-4 scoring).
Requires a diverse dataset with sufficient coverage of the difficulty spectrum.
Hyperparameters (alpha, eta) may need tuning for different dataset scales or reward distributions.

Reproducibility

Code: https://github.com/limenlp/verl

Code is publicly available at github.com/limenlp/verl. Dataset with difficulty scores is available on HuggingFace. Difficulty estimation method (pass@128) is fully described.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning on competition-level problems

Benchmarks:

MATH 500 (General Math)
AMC 23 (Competition Math)
AIME 24 (Advanced Competition Math)
OlympiadBench (Olympiad Math/Physics)

Metrics:

Accuracy (pass@1)
Training time / Efficiency
Statistical methodology: Pearson correlation reported for difficulty estimation analysis.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MATH Dataset	Pearson Correlation (r)	0	-0.34	-0.34
MATH Dataset	Solve Rate (%)	52.7	86.0	+33.3
Training Efficiency	Training Time	1.0	0.5	-0.5

Experiment Figures

Validation of the difficulty estimation method. (a) Confidence of estimation vs number of samples. (b) Average solve rate vs AoPS difficulty levels.

Main Takeaways

Targeting a 50% success rate (reward variance maximization) is theoretically and empirically optimal for RFT efficiency.
Difficulty estimation is robust: 64 rollouts provide difficulty estimates within ±0.05 of the 'ground truth' (128 rollouts) over 90% of the time.
AdaRFT effectively handles imbalanced data distributions (skew-easy or skew-hard) where static baselines fail to find learning signal.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO)
Curriculum Learning
Large Language Models (LLMs) training

Key Terms

RFT: Reinforcement Finetuning—using reinforcement learning (like PPO) to optimize a pre-trained model for a specific task using reward signals

PPO: Proximal Policy Optimization—a standard reinforcement learning algorithm that updates the model policy while preventing destructively large updates via clipping

Curriculum Learning: A training strategy where the model is presented with easier examples first, gradually increasing difficulty, mimicking human education

Pass@k: A metric measuring the probability that at least one correct solution is generated out of k attempts

Rollout: A complete generation sequence produced by the model (action) in response to a prompt (state)

AoPS: Art of Problem Solving—a math education platform whose difficulty classification system is used as a ground truth baseline in this paper