← Back to Paper List

Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't

Quy-Anh Dang, Chris Ngo
arXiv.org (2025)
Reasoning RL Benchmark

📝 Paper Summary

Small LLM Fine-tuning Mathematical Reasoning Reinforcement Learning (RL)
Reinforcement learning with Group Relative Policy Optimization (GRPO) can significantly enhance mathematical reasoning in small 1.5B-parameter models using minimal compute and curated data, achieving performance comparable to larger proprietary models.
Core Problem
High-performance reasoning usually requires massive models and extensive computational resources, making advanced reasoning capabilities inaccessible for researchers with limited hardware constraints.
Why it matters:
  • Small LLMs (1-10B) are resource-efficient but typically lack deep reasoning capabilities without expensive large-scale fine-tuning
  • Current methods rely on millions of samples or massive clusters, creating a barrier to entry for democratizing advanced AI
  • Training small models with RL often leads to optimization instability and length collapse without careful constraint management
Concrete Example: When trained on the full 'open-s1' dataset without specific length controls, the 1.5B model's output length fluctuates wildly (dropping then exploding), leading to unreadable mixed-language content and performance degradation after 200 steps.
Key Novelty
Resource-Constrained RL for Small Reasoning Models (Open-RS)
  • Adapts the GRPO algorithm to work on limited hardware (4x A40 GPUs) by eliminating the critic model and using group-based baselines to reduce memory overhead
  • Stabilizes training by mixing problem difficulties (combining hard and easy math problems) and employing a cosine-based length reward to penalize verbosity without stifling reasoning
Evaluation Highlights
  • Achieves 46.7% accuracy on AIME24 with Open-RS3 (1.5B), surpassing OpenAI's o1-preview (44.6%) and DeepScaleR-1.5B-Preview (43.1%)
  • +17.0 percentage points on AMC23 (63% to 80%) using Open-RS2 compared to the base DeepSeek-R1-Distill-Qwen-1.5B model
  • Extremely cost-efficient: Training completes in <24 hours on 4 NVIDIA A40 GPUs for ~$42, compared to thousands of dollars for baseline models like DeepScaleR
Breakthrough Assessment
8/10
Demonstrates that highly capable reasoning models can be trained on consumer-accessible hardware for <$50, significantly lowering the barrier to entry for high-end AI research.
×