LIMR: Less is More for RL Scaling

📝 Paper Summary

Reinforcement Learning for LLMs Data Efficiency Reasoning

LIMR demonstrates that a small, strategically selected subset of RL training samples (1,389) aligned with model learning trajectories can outperform full datasets (8,523) in enhancing LLM reasoning.

Core Problem

Current RL training for reasoning relies on opaque, large-scale data requirements, leading to inefficient resource use and a lack of understanding about what data actually drives improvement.

Why it matters:

Scaling up RL training data is computationally expensive and may not yield proportional gains
Lack of transparency in pioneering works (o1, Deepseek R1) forces researchers to rely on trial-and-error for data scale
Trial-and-error approaches often miss the critical role of sample quality versus quantity

Concrete Example: Training a 7B model with the full 8,523-sample MATH dataset uses significant compute, but many samples exhibit flat learning curves or instability. LIMR identifies that only 1,389 of these samples actively drive learning, achieving equal or better performance with 80% less data.

Key Novelty

Learning Impact Measurement (LIM)

Analyzes individual sample learning trajectories during RL training to identify which samples align with the model's overall improvement curve
Selects a small subset of 'high-impact' samples based on a computed alignment score, discarding samples that show flat or erratic learning patterns

Architecture

Conceptual diagram of the Learning Impact Measurement (LIM) process and sample selection logic.

Evaluation Highlights

Achieves 32.5% accuracy on AIME24 with only 1,389 samples, outperforming the full dataset baseline (8,523 samples) and distilled baselines like LIMO (15.8%)
Matches or exceeds the performance of training on the full MATH dataset (78.0% vs 78.4% on MATH500) despite using ~83% less data
Surpasses random data selection (MATH-RAND) by ~13-16 percentage points across MATH500, AIME24, and AMC23

Breakthrough Assessment

8/10

Strongly challenges the 'scale is all you need' assumption for RL data. The proposed metric (LIM) is automated and effective, significantly boosting efficiency for 7B-scale models.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning (RL) fine-tuning of base Language Models for mathematical reasoning

Inputs: Natural language math problem q

Outputs: Reasoning chain and final answer a

Pipeline Flow

Initial RL Training (Probe Run)
Trajectory Analysis (LIM)
Data Selection
Final RL Training

System Modules

Probe RL Trainer

Train model on full dataset to collect reward trajectories for every sample

Model or implementation: Qwen2.5-Math-7B

LIM Selector

Compute alignment scores and filter dataset

Model or implementation: Mathematical function (alignment score calculation)

Final RL Trainer

Train the final policy using PPO on the selected subset

Model or implementation: Qwen2.5-Math-7B

Novel Architectural Elements

LIM (Learning Impact Measurement) pipeline: A pre-processing stage that runs a full RL loop to harvest learning dynamics (reward curves) per sample, then filters data based on trajectory alignment before the final training run

Modeling

Base Model: Qwen2.5-Math-7B

Training Method: PPO (Proximal Policy Optimization) via OpenRLHF

Objective Functions:

Purpose: Maximize expected reward while staying close to base model.

Formally: Standard PPO objective with clipped surrogate loss and KL penalty.

Training Data:

Source: MATH-FULL (8,523 samples)
Selection: LIMR subset (1,389 samples) based on alignment score > 0.6

Key Hyperparameters:

actor_learning_rate: 5e-7
critic_learning_rate: 9e-6
kl_coefficient: 0.01
+ 4 more
rollout_batch_size: 1024
train_batch_size: 256
temperature_exploration: 1.2
samples_per_prompt: 8

Compute: Not reported in the paper

Comparison to Prior Work

vs. LIMO/s1: LIMR uses pure RL on base models rather than SFT distillation; shows superiority at 7B scale where distillation fails
vs. DeepSeek R1/Kimi1.5: LIMR focuses on data efficiency and transparency, using orders of magnitude less data (1.4k vs presumably 100k+) for comparable gains in specific domains
vs. Rejection Sampling [not cited in paper]: LIMR selects training prompts based on learning dynamics, whereas rejection sampling typically selects trajectories for SFT

Limitations

Requires an initial 'probe' training run on the full dataset to compute trajectories, which incurs upfront compute cost
Tested primarily on mathematical reasoning (MATH, AIME, AMC); generalization to other domains (coding, writing) is unverified
Evaluation limited to 7B scale models; unclear if findings hold for much larger (70B+) models
Comparison baselines (LIMO, s1) use SFT while LIMR uses RL, conflating algorithm differences with data selection effects

Reproducibility

Code: https://github.com/GAIR-NLP/LIMR

publicly available (https://github.com/GAIR-NLP/LIMR). Code, datasets (LIMR), and trained models are released. Hyperparameters for PPO are detailed in the text.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning evaluation on competition-level problems

Benchmarks:

MATH500 (Mathematical Problem Solving)
AIME2024 (Competition Math)
AMC2023 (Competition Math)

Metrics:

Accuracy (Pass@1)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of different data selection strategies for RL training on Qwen2.5-Math-7B.
MATH500	Accuracy	78.4	78.0	-0.4
MATH500	Accuracy	65.2	78.0	+12.8
AIME24	Accuracy	30.8	32.5	+1.7
AMC23	Accuracy	82.5	83.1	+0.6
Comparison against SFT-based distillation methods (LIMO, s1) at 7B scale.
AIME24	Accuracy	15.8	32.5	+16.7
MATH500	Accuracy	55.8	78.0	+22.2

Experiment Figures

Training dynamics curves comparing MATH-FULL, MATH-RAND, and LIMR.

Bar chart comparing accuracy on benchmarks (MATH500, AIME24, AMC23) for different methods.

Main Takeaways

Data quality matters more than quantity for RL training: 1,389 selected samples perform as well as 8,523 total samples.
Learning dynamics are uneven: Many samples in standard datasets exhibit flat or unstable learning curves that do not contribute to model improvement.
RL is more data-efficient than SFT at 7B scale: On small datasets (~1k samples), LIMR (RL) vastly outperforms LIMO/s1 (SFT), suggesting RL is better at unlocking reasoning with limited data.
LIMR's advantage is robust: Consistent gains across MATH500, AIME24, and AMC23 indicate it's not overfitting to a specific test set.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF) concepts (PPO)
Language Model fine-tuning
Mathematical reasoning benchmarks (MATH, AIME)

Key Terms

PPO: Proximal Policy Optimization—an RL algorithm that updates a policy in stable steps using a clipped objective

LIM: Learning Impact Measurement—the authors' proposed method for scoring training samples based on how well their individual reward curves align with the model's global learning curve

SFT: Supervised Fine-Tuning—training on labeled examples (input-output pairs) using standard cross-entropy loss

RL: Reinforcement Learning—training models by rewarding correct outputs rather than just mimicking target text

alignment score: A calculated value measuring the correlation between a specific sample's reward trajectory and the model's average reward trajectory

OpenRLHF: An open-source framework for high-performance RLHF training

vLLM: A high-throughput and memory-efficient inference engine for LLMs

MATH500: A subset of the MATH benchmark used for evaluation

AIME24: American Invitational Mathematics Examination 2024—a challenging math competition benchmark

AMC23: American Mathematics Competitions 2023—a math competition benchmark