Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning

📝 Paper Summary

LLM Reasoning Reinforcement Learning from Human Feedback (RLHF)

R³ improves LLM reasoning using only sparse outcome supervision by training on reverse sequences—starting near the solution and progressively working backward to the problem statement—creating a curriculum of increasing difficulty.

Core Problem

Reinforcement learning for complex reasoning faces a dilemma: Outcome Supervision (OS) is cheap but provides sparse rewards that fail to guide long reasoning chains, while Process Supervision (PS) offers dense guidance but requires expensive, expert step-by-step annotations.

Why it matters:

LLMs struggle to optimize long reasoning chains because errors accumulate, and sparse rewards (only at the end) make it hard to identify which specific intermediate step caused failure
Gathering step-by-step human annotations for Process Supervision is prohibitively expensive and hard to scale compared to just collecting final answers
Existing RL methods often result in aimless exploration when the search space is large

Concrete Example: In a multi-step math problem requiring 5 reasoning steps, a standard RL model starting from scratch might wander aimlessly and rarely hit the correct final answer (reward 0). Because it rarely succeeds, it learns nothing. R³ starts the model at step 4 (given the ground truth for steps 1-3), making it easy to find the answer and get a reward, then gradually moves the start point back to step 3, 2, etc.

Key Novelty

Reverse Curriculum Reinforcement Learning (R³)

Instead of generating the full reasoning chain from scratch, the model starts training from intermediate states sampled from a correct demonstration
The starting point progressively slides from the end of the demonstration (near the solution) to the beginning (the original question)
This creates a curriculum where the model first solves 'easy' short-horizon completions before attempting full-horizon reasoning, providing dense-like signals using only outcome rewards

Architecture

Comparison of Outcome Supervision, Process Supervision, and the proposed R³ method.

Evaluation Highlights

Outperforms RL baseline (standard PPO) by 4.1 points on average across eight reasoning tasks using Llama2-7B
Surpasses Supervised Fine-Tuning (SFT) baseline by 11.4 points on average for program-based reasoning on GSM8K
CodeLlama-7B trained with R³ achieves comparable performance to much larger or closed-source models (like GPT-3.5-Turbo) on math tasks without using extra annotated data

Breakthrough Assessment

8/10

Elegantly solves the sparse reward problem in reasoning without expensive process labels. The 'reverse curriculum' analogy from robotics is successfully applied to LLMs with significant empirical gains.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning for multi-step reasoning tasks

Inputs: Prompt/Question s_0 and a correct demonstration trajectory τ_demo = {s_0, a_1, ..., a_T}

Outputs: Reasoning chain leading to a final answer

Pipeline Flow

Demonstration Sampling (Sample intermediate states)
Curriculum Scheduler (Determine start state distribution)
PPO Training (Optimize policy from start state to end)

System Modules

Curriculum Sampler

Selects a start state s_k from the ground truth demonstration τ based on the current curriculum stage

Policy Model

Generates the completion of the reasoning chain starting from s_k

Model or implementation: Llama-2-7B / CodeLlama-7B / Llama-2-13B

Environment/Reward

Evaluates the final answer correctness

Novel Architectural Elements

Reverse Curriculum Scheduler: Dynamically adjusts the starting position of generation within the demonstration trajectory during RL training.

Modeling

Base Model: Llama-2-7B, Llama-2-13B, CodeLlama-7B

Training Method: PPO (Proximal Policy Optimization) with Reverse Curriculum

Objective Functions:

Purpose: Maximize expected reward while staying close to the initial policy.

Formally: Standard PPO objective with KL penalty.
Purpose: Encourage numeric accuracy (soft reward).

Formally: Partial reward ε=0.1 for correct numeric extraction in math tasks.

Key Hyperparameters:

partial_reward_epsilon: 0.1
kl_coefficient_beta: Values typically around 0.02-0.1 (exact value varying by stage/setup)
number_of_stages_M: 5 or 6

Compute: Not reported in the paper

Comparison to Prior Work

vs. Outcome Supervision: R³ uses the same sparse signal but creates a dense learning signal via curriculum, converging faster and to higher performance.
vs. Process Supervision: R³ achieves similar step-wise guidance benefits without requiring any step-level annotations.
vs. SFT: R³ allows the model to explore and self-correct, rather than just imitating likelihood, leading to better generalization.

Limitations

Relies on the availability of correct demonstrations to sample start states (requires SFT data).
The curriculum design (number of stages, mixing strategy) introduces additional hyperparameters to tune.
Requires computable/verifiable outcome rewards (e.g., math answers), making it harder to apply to open-ended generation tasks without ground truth.

Reproducibility

Code: https://github.com/WooooDyy/LLM-Reverse-Curriculum-RL

Code and data are publicly available at https://github.com/WooooDyy/LLM-Reverse-Curriculum-RL. The paper provides algorithm pseudocode and hyperparameter discussions.

📊 Experiments & Results

Evaluation Setup

Evaluated on 8 reasoning datasets covering math and commonsense reasoning.

Benchmarks:

GSM8K (Grade School Math)
MATH (Challenging Math Problems)
SVAMP (Math Word Problems)
MultiArith (Arithmetic Reasoning)

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison against standard RL (Outcome Supervision) baseline across multiple tasks using Llama2-7B.
Average (8 tasks)	Accuracy	45.0	49.1	+4.1
GSM8K	Accuracy	42.5	46.7	+4.2
Program-based reasoning results (generating code to solve math problems) on GSM8K.
GSM8K	Accuracy	48.2	59.6	+11.4

Experiment Figures

Performance curves comparing 'Vanilla staged RL' (sequential stages) vs 'R³' (mixed stages).

Main Takeaways

R³ consistently outperforms both SFT and standard RL with outcome supervision across diverse reasoning tasks.
The method is particularly effective for program-based reasoning (math via code), showing large gains over SFT.
Mixing stages (training on both easy/late start states and hard/early start states simultaneously) is crucial for stability and preventing catastrophic forgetting of earlier curriculum stages.
The approach effectively bridges the gap between outcome and process supervision without requiring extra human annotation effort.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, Policy Gradient)
Language Modeling (Chain-of-Thought)
Curriculum Learning

Key Terms

R³: Reverse Curriculum Reinforcement Learning—the proposed method that trains models by sliding the start state from the end of a demonstration to the beginning.

Outcome Supervision: Providing a reward signal only based on the correctness of the final answer, without evaluating intermediate steps.

Process Supervision: Providing reward signals at each intermediate step of the reasoning chain, typically requiring human annotation.

PPO: Proximal Policy Optimization—a standard policy gradient method used for training.

SFT: Supervised Fine-Tuning—training the model to mimic demonstrations via maximum likelihood estimation.

Chain-of-Thought: A prompting or generation style where the model produces intermediate reasoning steps before the final answer.

Program-based reasoning: Generating executable code (e.g., Python) to solve reasoning problems rather than natural language text.