Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't

📝 Paper Summary

Small LLM Fine-tuning Mathematical Reasoning Reinforcement Learning (RL)

Reinforcement learning with Group Relative Policy Optimization (GRPO) can significantly enhance mathematical reasoning in small 1.5B-parameter models using minimal compute and curated data, achieving performance comparable to larger proprietary models.

Core Problem

High-performance reasoning usually requires massive models and extensive computational resources, making advanced reasoning capabilities inaccessible for researchers with limited hardware constraints.

Why it matters:

Small LLMs (1-10B) are resource-efficient but typically lack deep reasoning capabilities without expensive large-scale fine-tuning
Current methods rely on millions of samples or massive clusters, creating a barrier to entry for democratizing advanced AI
Training small models with RL often leads to optimization instability and length collapse without careful constraint management

Concrete Example: When trained on the full 'open-s1' dataset without specific length controls, the 1.5B model's output length fluctuates wildly (dropping then exploding), leading to unreadable mixed-language content and performance degradation after 200 steps.

Key Novelty

Resource-Constrained RL for Small Reasoning Models (Open-RS)

Adapts the GRPO algorithm to work on limited hardware (4x A40 GPUs) by eliminating the critic model and using group-based baselines to reduce memory overhead
Stabilizes training by mixing problem difficulties (combining hard and easy math problems) and employing a cosine-based length reward to penalize verbosity without stifling reasoning

Evaluation Highlights

Achieves 46.7% accuracy on AIME24 with Open-RS3 (1.5B), surpassing OpenAI's o1-preview (44.6%) and DeepScaleR-1.5B-Preview (43.1%)
+17.0 percentage points on AMC23 (63% to 80%) using Open-RS2 compared to the base DeepSeek-R1-Distill-Qwen-1.5B model
Extremely cost-efficient: Training completes in <24 hours on 4 NVIDIA A40 GPUs for ~$42, compared to thousands of dollars for baseline models like DeepScaleR

Breakthrough Assessment

8/10

Demonstrates that highly capable reasoning models can be trained on consumer-accessible hardware for <$50, significantly lowering the barrier to entry for high-end AI research.

⚙️ Technical Details

Problem Definition

Setting: Mathematical reasoning Question Answering (QA)

Inputs: Natural language math question q

Outputs: Reasoning trace enclosed in <think> tags and final answer in \boxed{} format

Pipeline Flow

Input Question Processing
Reasoning Generation (Policy Model)
Output Formatting

System Modules

Reasoning Generator

Generate reasoning trace and final answer for the math problem

Model or implementation: DeepSeek-R1-Distill-Qwen-1.5B (fine-tuned)

Modeling

Base Model: DeepSeek-R1-Distill-Qwen-1.5B

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Optimize policy to maximize expected reward.

Formally: Maximize objective J(θ) involving advantage A_i computed from group rewards and clipped probability ratios.
Purpose: Prevent model drift.

Formally: Subtract KL divergence term β * D_KL(π_θ || π_ref).

Trainable Parameters: Full model parameters (1.5B)

Training Data:

Curated dataset of 7,000 samples mixed from open-s1, open-deepscaler, and raw DeepScaleR (easy problems)
Data filtering removes trivial or noisy questions and ensures solutions have \boxed{} format

Key Hyperparameters:

epochs: 1
max_completion_length: 4096 (Exp 1) / 3584 (Exp 2/3)
group_size: 6
+ 2 more
learning_rate: Not explicitly reported in the paper
beta: KL penalty coefficient (value not explicitly reported)

Compute: 4 NVIDIA A40 GPUs (48GB VRAM each), <24 hours training time

Comparison to Prior Work

vs. DeepScaleR-1.5B: Uses significantly less data (7k vs 40k) and compute ($42 vs ~$3600) while achieving comparable or better performance on specific benchmarks
vs. DeepSeek-R1: Adapts the GRPO methodology to strictly limited hardware (4x A40s) for small models, removing the critic entirely to save VRAM
vs. Standard RLHF [not cited in paper]: Uses rule-based outcome rewards (accuracy/format) instead of a learned reward model

Limitations

Optimization instability persists after ~200 steps, leading to performance degradation and language mixing
Model struggles with multilingual outputs (drifting from English) without explicit constraints
Performance on highly complex multidisciplinary benchmarks like Minerva lags behind larger (7B) models

Reproducibility

Code: https://github.com/knoveleng/open-rs

publicly available (https://github.com/knoveleng/open-rs). Code, curated datasets (open-s1, open-deepscaler), and training configurations are released. Checkpoints Open-RS1/2/3 are described but specific weights URL not explicitly linked in text.

📊 Experiments & Results

Evaluation Setup

Zero-shot mathematical reasoning evaluation

Benchmarks:

AMC23 (High school mathematics competition)
AIME24 (Advanced invitational mathematics examination)
MATH-500 (Challenging math problems)
Minerva (Scientific and mathematical reasoning)
OlympiadBench (Olympiad-level math and physics)

Metrics:

Zero-shot pass@1 accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis shows the proposed Open-RS models outperforming or matching significantly more expensive baselines on key math benchmarks.
AIME24	Accuracy (%)	44.6	46.7	+2.1
AMC23	Accuracy (%)	63.0	80.0	+17.0
AIME24	Accuracy (%)	43.1	46.7	+3.6
Training Cost	USD ($)	3629	42	-3587

Experiment Figures

Accuracy curves on AMC23 and MATH-500 across training steps for three experiments

Completion length statistics during training

Main Takeaways

Small LLMs can achieve competitive reasoning performance with minimal data (7k samples) if training stabilizes correctly
Mixing problem difficulties (easy + hard) is crucial for training stability; using only hard problems (open-s1) leads to collapse
Cosine-based length rewards effectively control output verbosity, preventing the model from hitting token limits, though language drift remains a challenge

📚 Prerequisite Knowledge

Prerequisites

Understanding of Reinforcement Learning (RL) for LLMs
Familiarity with Chain-of-Thought (CoT) prompting
Basic knowledge of policy gradient methods

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates baselines from a group of sampled outputs rather than training a separate value function (critic) model, saving memory

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

SFT: Supervised Fine-Tuning—training a model on labeled examples, often used as a precursor to RL (though skipped in this paper)

KL divergence: A statistical measure used as a penalty to prevent the RL-trained model from deviating too far from the behavior of the original reference model

Cosine Reward: A reward shaping function that scales the accuracy reward based on output length using a cosine schedule, incentivizing concise correct answers