Self-Evolving Curriculum for LLM Reasoning

📝 Paper Summary

Reinforcement Learning for LLMs Curriculum Learning

SEC treats curriculum selection as a non-stationary Multi-Armed Bandit problem where an agent dynamically selects problem categories based on the absolute advantage of the current policy.

Core Problem

Standard RL fine-tuning curricula are either static (random/heuristics) or computationally expensive (online filtering), failing to align problem difficulty with the model's evolving capabilities.

Why it matters:

Suboptimal curricula (like reverse difficulty) can severely stunt learning and generalization, as shown in controlled experiments
Manual heuristics require human expertise and don't adapt to specific model progress
Existing automatic methods often require expensive extra inference passes to estimate difficulty

Concrete Example: In the Countdown game, a reverse curriculum (hard-to-easy) causes the model to fail completely on test sets (score ~0.0), whereas a random curriculum achieves ~0.4. Neither adapts to the model's actual competence.

Key Novelty

Self-Evolving Curriculum (SEC)

Models curriculum selection as a Multi-Armed Bandit (MAB) problem where each arm is a problem category (e.g., difficulty level)
Uses 'absolute advantage' from the RL policy update as a proxy for learning gain, avoiding expensive separate evaluation steps
Updates curriculum probabilities on-the-fly using TD(0) to favor categories that currently yield the highest gradient updates

Architecture

Overview of the Self-Evolving Curriculum (SEC) process loop during RL fine-tuning.

Evaluation Highlights

+33% relative improvement on AIME24 math problems with Qwen2.5-3B compared to random curriculum
+13% relative improvement on Countdown OOD tasks with Qwen2.5-3B compared to random curriculum
Achieves stable multi-task learning (Countdown+Zebra+ARC), preventing the performance collapse seen with random curricula

Breakthrough Assessment

7/10

Strong empirical gains on reasoning tasks and a theoretically grounded, lightweight method. The reliance on predefined categories is a slight limitation, though addressed partially with automatic binning.

⚙️ Technical Details

Problem Definition

Setting: RL fine-tuning of an LLM policy π_θ on a dataset D partitioned into N categories C = {c_1, ..., c_N}.

Inputs: Training set D with category labels (manual or inferred)

Outputs: Fine-tuned LLM policy π_θ

Pipeline Flow

Curriculum Policy (samples categories based on Q-values)
Data Sampler (draws problems uniformly from selected categories)
LLM Policy (generates solutions and updates weights via RL)
Reward Calculator (computes absolute advantage per category)
Curriculum Updater (updates Q-values via TD(0))

System Modules

Curriculum Policy (Curriculum Selection)

Selects which problem category to train on next

Model or implementation: Boltzmann distribution over learnt Q-values

LLM Policy

Solves the selected reasoning problems

Model or implementation: Qwen2.5-3B / Qwen2.5-7B

Curriculum Updater (Curriculum Selection)

Updates the expected value of each category based on observed learning gain

Model or implementation: TD(0) update rule

Novel Architectural Elements

Feedback loop where RL update magnitude (absolute advantage) directly drives the sampling probability of data categories via a non-stationary MAB

Modeling

Base Model: Qwen2.5-3B and Qwen2.5-7B

Training Method: GRPO (Group Relative Policy Optimization) primarily; PPO and RLOO in ablations

Objective Functions:

Purpose: Optimize LLM policy to maximize task reward.

Formally: L_PG(θ) = -E[log π_θ(a_t|s_t) * A_t]
Purpose: Update curriculum policy Q-values to reflect learning gain.

Formally: Q_{t+1}(c) = α * r_t(c) + (1-α) * Q_t(c), where r_t(c) is the average absolute advantage

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: Not reported in the paper
curriculum_temperature_tau: Not reported in the paper
+ 1 more
curriculum_learning_rate_alpha: Not reported in the paper

Compute: Not explicitly reported in the paper, but method claims to avoid computational overhead of online filtering

Comparison to Prior Work

vs. Random: SEC adapts distribution to maximize learning gain (advantage magnitude)
vs. Difficulty-Ordered: SEC is non-monotonic and can revisit easier/harder tasks as needed; does not require fixed schedule
vs. AdaRFT: SEC formulation is general MAB, not strictly tied to difficulty ordering
+ 1 more
vs. Online Filtering (e.g. DAPO): SEC avoids expensive inference passes to estimate difficulty for every batch [not cited in paper]

Limitations

Requires problems to be categorizable (by difficulty or type), though automatic binning is possible
Introduces new hyperparameters (curriculum learning rate, temperature) requiring tuning
Gains are smaller on stronger base models (Qwen2.5-7B) for simpler tasks

Reproducibility

Code: https://github.com/ServiceNow/sec

Publicly available code (https://github.com/ServiceNow/sec). Datasets used are public (MATH, ARC, etc.). Specific hyperparameters like learning rate α and temperature τ are mentioned conceptually but exact values are not listed in the main text.

📊 Experiments & Results

Evaluation Setup

RL fine-tuning on reasoning tasks with verifiable rewards

Benchmarks:

Countdown (Arithmetic planning)
Zebra Puzzles (Logic puzzles)
ARC-1D (Inductive reasoning)
MATH (Mathematics)

Metrics:

pass@1 accuracy (avg over 8 generations)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results comparing SEC against Random and Ordered curricula on Qwen2.5-3B across multiple domains.
Countdown (OOD)	pass@1	0.479	0.542	+0.063
Zebra (OOD)	pass@1	0.285	0.345	+0.060
ARC-1D (OOD)	pass@1	0.313	0.381	+0.068
AIME24	pass@1	0.075	0.100	+0.025
Results for the larger Qwen2.5-7B model showing gains persist on harder tasks.
AIME24	pass@1	0.138	0.175	+0.037
Zebra (OOD)	pass@1	0.321	0.355	+0.034
SEC works with various RL algorithms (PPO, RLOO) beyond GRPO.
Countdown (OOD)	pass@1	0.159	0.224	+0.065
Countdown (OOD)	pass@1	0.465	0.494	+0.029

Experiment Figures

Plots of average sample difficulty over training steps for Countdown, Zebra, ARC-1D, and MATH.

Comparison of Test Score vs Training Steps for SEC vs Random vs Reverse Curriculum on Countdown.

Main Takeaways

SEC significantly improves generalization to Out-of-Distribution (OOD) tasks compared to Random and Ordered curricula.
In multi-task settings (SEC-2D), it prevents the performance collapse seen with random curricula by balancing different problem types.
The method is robust across different RL algorithms (GRPO, PPO, RLOO) and model sizes, though gains are smaller on easier tasks for larger models.
Can operate effectively with automatically inferred categories (via success rate binning) rather than requiring manual labels.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradient)
Multi-Armed Bandits (MAB)
Curriculum Learning

Key Terms

MAB: Multi-Armed Bandit—a problem where an agent must choose between multiple options (arms) to maximize reward, balancing exploration and exploitation

TD(0): Temporal Difference learning (0-step)—an update rule that adjusts estimates based on the immediate next reward and current estimate

absolute advantage: The absolute value of the advantage function |A_t|; used here as a proxy for learning gain because it scales the gradient norm in policy gradient methods

GRPO: Group Relative Policy Optimization—an RL algorithm used for reasoning tasks that normalizes rewards within a group of outputs

OOD: Out-of-Distribution—test data that differs significantly from training data (e.g., harder difficulty levels)

RLOO: REINFORCE Leave-One-Out—a policy gradient estimator that uses the mean of other samples as a baseline to reduce variance

PPO: Proximal Policy Optimization—an RL algorithm that updates policies with a clipped objective to ensure stability

POMDP: Partially Observable Markov Decision Process—a framework for decision making where the system state is not fully visible

pass@1: The probability that a single model generation is correct