← Back to Paper List

Token-Efficient RL for LLM Reasoning

Alan Lee, Harry Tong
arXiv (2025)
RL Reasoning Benchmark

📝 Paper Summary

Reinforcement Learning for LLMs Parameter-Efficient Fine-Tuning (PEFT) Mathematical Reasoning
S-GRPO and T-SPMO improve LLM reasoning capabilities using sparse, memory-efficient reinforcement learning updates compatible with LoRA, outperforming full-token methods that fail under low-compute constraints.
Core Problem
Standard RL methods like PPO and GRPO are computationally expensive (requiring full trajectory calculations and critics) and often fail to improve performance when restricted to low-rank adapters (LoRA) due to optimization difficulties.
Why it matters:
  • Researchers with limited GPU budgets cannot afford full-model fine-tuning or the memory overhead of critic networks required by standard PPO
  • Full-token optimization methods (like standard GRPO) can act as poor regularizers in low-parameter regimes (LoRA), failing to improve over base models
  • Existing methods lack fine-grained credit assignment, treating all tokens in a trajectory as equally responsible for the final reward
Concrete Example: When fine-tuning Qwen2-1.5B on math problems using LoRA, the standard full-token GRPO algorithm fails to improve accuracy over the base model, whereas the proposed sparse update methods significantly boost performance.
Key Novelty
Sparse Token-Level RL Optimization (S-GRPO & T-SPMO)
  • S-GRPO (Stochastic-GRPO) updates the policy using only a sampled subset of tokens (30-50%) from the output, prioritizing early tokens while stochastically dropping later ones to save memory
  • T-SPMO (Token-Specific Prefix Matching Optimization) builds prefix tries to assign credit to specific token transitions rather than whole sequences, updating fewer than 5% of tokens
Evaluation Highlights
  • Raises SVAMP accuracy from 46% (base model) to over 70% using S-GRPO and T-SPMO on Qwen2-1.5B
  • Achieves these gains while updating on only 30–50% of generated tokens for S-GRPO and under 5% for T-SPMO
  • Demonstrates that full-token GRPO with LoRA fails to improve over the base model, highlighting the necessity of sparse updates in low-rank settings
Breakthrough Assessment
7/10
Offers a practical solution for RL on consumer hardware by showing that sparse updates are not just efficient but necessary for LoRA stability. Strong empirical gains on SVAMP.
×