Learning to Reason at the Frontier of Learnability

📝 Paper Summary

Reinforcement Learning for LLMs Curriculum Learning Reasoning

LILO improves RL training efficiency for reasoning LLMs by prioritizing questions with high learnability (outcome variance), theoretically maximizing expected policy improvement.

Core Problem

Standard RL training for LLMs wastes significant compute on questions that are either too hard (always fail) or too easy (always succeed), yielding near-zero gradients.

Why it matters:

Training Large Language Models (LLMs) with Reinforcement Learning (RL) is extremely compute-intensive
Human effort is currently wasted manually curating datasets of appropriate difficulty levels for evolving models
Existing methods like PPO and GRPO fail to learn efficiently when the success variance on training examples is zero

Concrete Example: If a model attempts a calculus problem 8 times and fails every time (reward 0), or attempts '1+1' and succeeds every time (reward 1), the variance is 0. Standard algorithms like RLOO compute an advantage of 0 for these cases, resulting in no model update despite the compute cost of generation.

Key Novelty

Learnability-Prioritized Training (LILO)

Defines 'learnability' as the variance of success (reward) on a given question
Theoretically proves that expected policy improvement scales linearly with this learnability metric for advantage-based RL algorithms
Uses rejection sampling to dynamically select a training batch of questions where the model currently has non-zero success variance (frontier of knowledge)

Evaluation Highlights

Achieves 3.3x speedup in training steps to reach baseline accuracy using VinePPO on GSM8K
Improves final test accuracy by +2.7% on MATH dataset with PPO compared to standard uniform sampling
Increases accuracy on the large-scale ORZ57K dataset by +1.6% using GRPO with Qwen-2.5-1.5B

Breakthrough Assessment

8/10

Provides a strong theoretical foundation for curriculum learning in LLM RL and demonstrates significant efficiency gains (3x speedup) across multiple standard algorithms and datasets.

⚙️ Technical Details

Problem Definition

Setting: Markov Decision Process (MDP) for reasoning tasks with binary terminal rewards

Inputs: Reasoning question q

Outputs: Answer sequence a terminating in a final answer

Modeling

Base Model: Rho-1B, Qwen-2.5-1.5B, and Oat-Zero-1.5B

Training Method: Reinforcement Learning (PPO, VinePPO, GRPO) augmented with LILO data selection

Objective Functions:

Purpose: Maximize expected reward.

Formally: J(θ) = E[r(q, a)]
Purpose: Calculate Learnability (Variance).

Formally: Learnability(π) = p(1-p) where p is the empirical success rate

Training Data:

Candidate pool sampled from dataset (|D| = 4x or 8x batch size)
Learnability estimated using N=8 rollouts per question
Top-batch-size questions selected for training update

Key Hyperparameters:

N_learnability: 8 (number of rollouts to estimate variance)
Candidate_pool_size: 4 * |B| (typically) or 8 * |B| (for GSM8K with VinePPO)
N_train: Matches N_learnability for PPO/GRPO experiments; ~500 for VinePPO

Compute: TPUs (v4/v5) used. Sampling overhead is negligible for large-scale methods (VinePPO) or amortized by reusing samples (PPO/GRPO).

Comparison to Prior Work

vs. RLOO/GRPO: LILO filters training data based on success variance, avoiding zero-gradient samples
vs. Rho-1b: LILO applies curriculum concepts to the Reinforcement Learning phase rather than SFT
vs. 'Not all rollouts are useful' [not cited in paper]: Similar intuition of filtering, but LILO provides theoretical proof linking variance to expected improvement

Limitations

Requires additional sampling compute to estimate learnability (though samples can be reused)
Rejection sampling becomes inefficient late in training when few questions remain learnable (high success rates)
Does not account for aleatoric uncertainty (unsolvable/random questions might have high variance)
Only validated on mathematical reasoning tasks with binary rewards

Reproducibility

Code availability is not explicitly provided in the paper (no URL). The method is described as implementable in fewer than 20 lines of code. Datasets (MATH, GSM8K, ORZ57K) and base models (Qwen, Rho) are public.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks with binary success/failure feedback

Benchmarks:

MATH (Competition-level mathematics)
GSM8K (Grade school math)
ORZ57K (Large amalgamated math dataset)

Metrics:

Test Accuracy (Pass@1)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
PPO experiments showing LILO improves both speed and final accuracy on MATH and GSM8K.
MATH	Test Accuracy	19.1	21.8	+2.7
GSM8K	Test Accuracy	51.1	53.2	+2.1
VinePPO experiments demonstrating effectiveness on a more advanced RL algorithm.
MATH	Test Accuracy	22.8	24.9	+2.1
GSM8K	Test Accuracy	53.2	55.9	+2.7
GRPO experiments on a larger dataset and model.
ORZ57K	Test Accuracy	35.5	37.1	+1.6

Experiment Figures

Test accuracy curves for PPO vs PPO+LILO on MATH and GSM8K over training steps.

The average learnability of questions in the training batch over time.

Main Takeaways

Prioritizing learnability consistently improves final test accuracy across different RL algorithms (PPO, VinePPO, GRPO) and datasets.
LILO significantly reduces training time (up to 3.3x fewer steps) by avoiding non-informative gradients from too-easy or too-hard questions.
The difficulty of finding learnable questions increases as training progresses, requiring larger candidate pools (rejection sampling) in later stages.
Questions selected by LILO naturally correlate with higher reasoning complexity (more steps) over time, creating an implicit curriculum.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradient methods)
Large Language Models (LLMs)
Variance and Expectation

Key Terms

Learnability: Defined here as the variance of the reward (success/fail) for a specific question given the current policy

LILO: Learnability Improves LLMs Optimally—the proposed method using rejection sampling to prioritize high-variance questions

PPO: Proximal Policy Optimization—a popular policy gradient RL algorithm

GRPO: Group Robust Preference Optimization—an RL algorithm that normalizes advantages across a group of outputs for the same input

VinePPO: A variant of PPO using value-informed estimation

Rejection Sampling: A technique used here to select a subset of questions with high learnability from a larger candidate pool

Pass@1: The accuracy metric measuring if the model's single generated answer is correct