Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning

📝 Paper Summary

Reinforcement Learning for LLMs Curriculum Learning

E2H Reasoner improves LLM reasoning by training on tasks probabilistically scheduled from easy to hard using a Gaussian distribution, preventing forgetting and overfitting while solving sparse reward problems.

Core Problem

Reinforcement learning for LLM reasoning fails on inherently hard tasks because the distribution gap between pre-training and target tasks causes sparse rewards, while standard curriculum learning suffers from forgetting previous tasks.

Why it matters:

Models like DeepSeek-R1 rely on RL post-training but struggle with complex reasoning where zero-shot performance is low, as the model rarely sees a positive reward signal.
Naive curricula (training easy then hard) lead to catastrophic forgetting, where the model loses the ability to solve simpler foundational problems necessary for complex generalization.

Concrete Example: In a 6-number Countdown task (combine 6 numbers to reach a target), a model failing to perform basic arithmetic (a 2-number task) will never accidentally solve the hard task to receive a reward, leading to zero learning progress.

Key Novelty

E2H Reasoner (Easy-to-Hard with Gaussian Scheduling)

Decomposes reasoning datasets into difficulty levels (Trivial, Easy, Medium, Hard) based on human labels or model error rates.
Uses a probabilistic Gaussian scheduler that slides a 'window' of focus across difficulties over time. Unlike step functions, it maintains non-zero probability for easier tasks to prevent forgetting.

Architecture

Conceptual illustration of the Distribution Gap problem. It shows the Pre-training distribution (d0) and the Target Task distribution (dK) being far apart.

Evaluation Highlights

Achieves state-of-the-art performance across five reasoning tasks: Blocksworld, Countdown, MATH, AQuA, and GSM8K (exact numbers not provided in source text).
Demonstrates capability to learn tasks that initially had near-zero success rates in the zero-shot setting.
Theoretical analysis proves that curriculum-based Approximate Policy Iteration requires fewer total samples than direct learning on the hardest task.

Breakthrough Assessment

7/10

Strong theoretical grounding (API analysis) combined with a practical probabilistic scheduler. Addresses the critical 'cold start' problem in RL for reasoning, though primarily an algorithmic refinement of curriculum learning.

⚙️ Technical Details

Problem Definition

Setting: Discounted Markov Decision Process (MDP) where states are token prefixes and actions are vocabulary tokens.

Inputs: Reasoning problem prompt (x)

Outputs: Answer (y) wrapped in answer tags, preceded by reasoning steps in think tags.

Pipeline Flow

Input Processing (Prompt)
Reasoning Generation (Policy)
Answer Formatting

System Modules

Input Processing (Inference)

Ingests the problem statement and formats it with reasoning tags.

Model or implementation: Base LLM (Not explicitly named in text)

Reasoning Policy (Inference)

Generates intermediate 'think' tokens and final 'answer' tokens. This policy is the result of the E2H training process.

Model or implementation: Transformer (Decoder-only)

Novel Architectural Elements

Gaussian Scheduling mechanism (integrated into the training loop, not inference architecture) that dynamically adjusts the sampling probability of tasks based on difficulty levels.

Modeling

Base Model: Not explicitly reported in the paper snippet provided

Training Method: Reinforcement Learning with Curriculum Scheduling (E2H Reasoner)

Objective Functions:

Purpose: Maximize expected cumulative reward.

Formally: Maximize E[sum(gamma^t * r(s,a))].
Purpose: Curriculum sampling probability (Gaussian).

Formally: S_gauss(t, k) ~ exp( - (k - (t/T)*(K-1))^2 / (2 * sigma^2) )

Training Data:

Data split into 4 levels: Trivial, Easy, Medium, Hard.
Split method 1: Human annotations (plan length, operand count).
Split method 2: Model error rates (quartiles of error distribution on 1-shot CoT).

Key Hyperparameters:

beta: 0.25, 0.5, or 0.75 (controls curriculum speed)
sigma: 0.25, 0.5, or 0.75 (controls sampling variance/concentration)
gamma: Discount factor (in MDP formulation)

Compute: Not reported in the paper

Comparison to Prior Work

vs. DeepSeek-R1: E2H uses curriculum to bridge the distribution gap for harder tasks where R1-Zero might fail to bootstrap.
vs. Traditional Curriculum: E2H uses probabilistic Gaussian/Cosine mixing rather than hard switches, preventing task forgetting.
vs. Balanced Sampling: E2H prioritizes easy tasks early to generate rewards, whereas balanced sampling introduces hard tasks too early (sparse rewards).

Limitations

Requires task difficulty annotations or a heuristic (like model error rate) to decompose tasks effectively.
Gaussian scheduler introduces hyperparameters (sigma, beta) that may need tuning per task.
Specific improvements and base model details were not accessible in the provided text snippet.

Reproducibility

Code: https://github.com/divelab/E2H-Reasoning

Code is publicly available at https://github.com/divelab/E2H-Reasoning. The paper provides theoretical proofs in Appendix A and mathematical definitions for the schedulers.

📊 Experiments & Results

Evaluation Setup

RL post-training on reasoning tasks followed by evaluation on held-out problems.

Benchmarks:

Blocksworld (Planning)
Countdown (Arithmetic / Search)
MATH (Mathematics)
GSM8K (Grade School Math)
AQuA (Algebra)

Metrics:

Success Rate / Accuracy
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Visualization of the Gaussian Scheduling strategy showing probability density functions over task difficulties across training time.

Main Takeaways

Curriculum Learning (CRL) allows LLMs to solve tasks where they initially have near-zero success rates by bridging the distribution gap.
Gaussian scheduling outperforms traditional step-function curricula by maintaining a mix of difficulties, which prevents the model from forgetting easier tasks or overfitting to trivial patterns.
Theoretical analysis confirms that decomposing tasks and learning sequentially reduces the sample complexity compared to trying to learn the hard task directly.
Easy tasks are crucial for initial signal (bootstrapping), but must be faded out to prevent the model from 'reward hacking' (preferring trivial solutions).

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (MDPs, Policy Iteration)
Curriculum Learning concepts
LLM Post-training (SFT vs RL)

Key Terms

CRL: Curriculum Reinforcement Learning—training an agent on a sequence of tasks ordered by increasing difficulty.

API: Approximate Policy Iteration—a theoretical framework for analyzing RL algorithms that alternate between estimating value functions and updating policies.

Gaussian Scheduling: A proposed task sampling method where the probability of selecting a task difficulty follows a Gaussian distribution that shifts its mean from easy to hard over training steps.

SFT: Supervised Fine-Tuning—training a model to imitate fixed input-output examples.

MDP: Markov Decision Process—a mathematical framework for modeling decision making where outcomes are partly random and partly under the control of a decision maker.

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer.

Reward Hacking: When a model learns to exploit flaws in the reward function (e.g., giving short, trivial answers) rather than solving the actual task.

DeepSeek-R1: A recent family of reasoning models trained via reinforcement learning that served as inspiration for this work.