← Back to Paper List

REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour, Andreas Köpf
arXiv (2025)
Reasoning RL Benchmark

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Reasoning Benchmarks
Reasoning Gym is a library of procedurally generated reasoning environments that enables reinforcement learning with verifiable rewards, infinite training data, and adjustable difficulty to overcome fixed-dataset limitations.
Core Problem
Current reasoning research is bottlenecked by fixed-size datasets that are expensive to curate, prone to memorization, and lack the reliable verification mechanisms needed for reinforcement learning.
Why it matters:
  • RLVR relies on high-quality outcome-based feedback, but static datasets are scarce and quickly exhausted by powerful models.
  • Fixed benchmarks allow models to memorize answers rather than learn generalizable reasoning strategies.
  • Scraped internet data is unreliable and unsustainable for scaling reasoning capabilities.
Concrete Example: When a model is trained on a static math dataset, it might memorize that 'x^2 - 4 = 0' implies 'x=2,-2'. However, Reasoning Gym generates infinite variations like '3y^2 - 27 = 0' with different variable names and constants, forcing the model to learn the underlying algebraic procedure rather than the specific instance.
Key Novelty
Procedurally Generated Reasoning Environments for RLVR
  • Instead of static Q&A pairs, tasks are defined as algorithms that generate unlimited unique instances with automatic verification.
  • Parameters allow fine-grained control over difficulty (e.g., polynomial degree, graph size) and style (e.g., variable names), enabling precise curriculum learning.
  • Provides unambiguous, verifiable rewards for every generated instance, eliminating the need for human labeling or unstable LLM-as-a-judge evaluation.
Evaluation Highlights
  • Reasoning-optimized models like o3-mini (63.5%) significantly outperform general-purpose models like Llama 4 Maverick (41.5%) across Reasoning Gym tasks.
  • Algorithmic training transfers broadly: models trained on algorithmic tasks improve by +29.1% on held-out algebra tasks and +22.3% on geometry tasks.
  • RLVR training on Reasoning Gym improves performance on external benchmarks, yielding +9.7% on MATH and +7.7% on Big-Bench Hard using Qwen2.5-3B-Instruct.
Breakthrough Assessment
9/10
Addresses the critical 'data wall' in reasoning research by replacing static datasets with infinite procedural environments. The strong transfer results to external benchmarks validate that synthetic RLVR training builds real, generalizable skills.
×