← Back to Paper List

Prompt Curriculum Learning for Efficient LLM Post-Training

Zhaolin Gao, Joongwon Kim, Wen Sun, Thorsten Joachims, Sid Wang, R. Pang, Liang Tan
Meta Superintelligence Labs, Cornell University, University of Washington
arXiv.org (2025)
RL Reasoning Benchmark

📝 Paper Summary

Reinforcement Learning (RL) Curriculum Learning Data Selection
PCL improves RL post-training efficiency by using an online value model to filter for intermediate-difficulty prompts, avoiding expensive rollout-based estimation while maximizing gradient signal.
Core Problem
Post-training LLMs via RL is computationally expensive because learning from prompts that are too easy (always correct) or too hard (always incorrect) yields zero gradient signal, wasting compute.
Why it matters:
  • Standard RL approaches (like GRPO) uniformly sample prompts, wasting significant compute on uninformative examples where the model has no learning signal
  • Existing filtering methods either require costly online rollouts to estimate difficulty (slowing training) or rely on off-policy historical dictionaries (inaccurate for current policy)
  • Hyperparameters like batch size heavily impact convergence speed but are often selected heuristically rather than systematically optimized
Concrete Example: If a model is trained on a math problem it already solves 100% of the time, the advantage is zero and the model learns nothing. Similarly, if it fails 100% of the time, the gradient is zero. PCL targets problems with ~50% success rate.
Key Novelty
Prompt Curriculum Learning (PCL)
  • Replace expensive generative rollouts with a lightweight value model that predicts prompt difficulty (expected reward) in a single forward pass
  • Dynamically filter a large pool of prompts to select only those where the predicted success rate is near a target threshold (e.g., 0.5), maximizing gradient information
  • Update the value model online using the rewards from the policy's own generations, keeping the difficulty curriculum synchronized with the model's improving capabilities
Evaluation Highlights
  • Achieves 12.1x and 16.9x speedup in prompt difficulty identification on MATH and DeepScaleR respectively compared to rollout-based filtering
  • Outperforms GRPO baseline by +1.8% accuracy on MATH500 using Qwen3-8B-Base (88.2% vs 86.4%)
  • Reduces time-to-convergence on DeepScaleR benchmarks with Qwen3-4B-Base by ~28% (32.8h vs 45.5h) while achieving higher average accuracy
Breakthrough Assessment
8/10
Provides a systematic analysis of batch size trade-offs and offers a practical, efficient solution to the 'cold start/vanishing gradient' problem in RL for reasoning. The reliance on an online value model for filtering is a clean, effective engineering insight.
×