← Back to Paper List

Learning to Reason at the Frontier of Learnability

Thomas Foster, Jakob N. Foerster
University of Oxford
arXiv.org (2025)
RL Reasoning Benchmark

📝 Paper Summary

Reinforcement Learning for LLMs Curriculum Learning Reasoning
LILO improves RL training efficiency for reasoning LLMs by prioritizing questions with high learnability (outcome variance), theoretically maximizing expected policy improvement.
Core Problem
Standard RL training for LLMs wastes significant compute on questions that are either too hard (always fail) or too easy (always succeed), yielding near-zero gradients.
Why it matters:
  • Training Large Language Models (LLMs) with Reinforcement Learning (RL) is extremely compute-intensive
  • Human effort is currently wasted manually curating datasets of appropriate difficulty levels for evolving models
  • Existing methods like PPO and GRPO fail to learn efficiently when the success variance on training examples is zero
Concrete Example: If a model attempts a calculus problem 8 times and fails every time (reward 0), or attempts '1+1' and succeeds every time (reward 1), the variance is 0. Standard algorithms like RLOO compute an advantage of 0 for these cases, resulting in no model update despite the compute cost of generation.
Key Novelty
Learnability-Prioritized Training (LILO)
  • Defines 'learnability' as the variance of success (reward) on a given question
  • Theoretically proves that expected policy improvement scales linearly with this learnability metric for advantage-based RL algorithms
  • Uses rejection sampling to dynamically select a training batch of questions where the model currently has non-zero success variance (frontier of knowledge)
Evaluation Highlights
  • Achieves 3.3x speedup in training steps to reach baseline accuracy using VinePPO on GSM8K
  • Improves final test accuracy by +2.7% on MATH dataset with PPO compared to standard uniform sampling
  • Increases accuracy on the large-scale ORZ57K dataset by +1.6% using GRPO with Qwen-2.5-1.5B
Breakthrough Assessment
8/10
Provides a strong theoretical foundation for curriculum learning in LLM RL and demonstrates significant efficiency gains (3x speedup) across multiple standard algorithms and datasets.
×