REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Reasoning Benchmarks

Reasoning Gym is a library of procedurally generated reasoning environments that enables reinforcement learning with verifiable rewards, infinite training data, and adjustable difficulty to overcome fixed-dataset limitations.

Core Problem

Current reasoning research is bottlenecked by fixed-size datasets that are expensive to curate, prone to memorization, and lack the reliable verification mechanisms needed for reinforcement learning.

Why it matters:

RLVR relies on high-quality outcome-based feedback, but static datasets are scarce and quickly exhausted by powerful models.
Fixed benchmarks allow models to memorize answers rather than learn generalizable reasoning strategies.
Scraped internet data is unreliable and unsustainable for scaling reasoning capabilities.

Concrete Example: When a model is trained on a static math dataset, it might memorize that 'x^2 - 4 = 0' implies 'x=2,-2'. However, Reasoning Gym generates infinite variations like '3y^2 - 27 = 0' with different variable names and constants, forcing the model to learn the underlying algebraic procedure rather than the specific instance.

Key Novelty

Procedurally Generated Reasoning Environments for RLVR

Instead of static Q&A pairs, tasks are defined as algorithms that generate unlimited unique instances with automatic verification.
Parameters allow fine-grained control over difficulty (e.g., polynomial degree, graph size) and style (e.g., variable names), enabling precise curriculum learning.
Provides unambiguous, verifiable rewards for every generated instance, eliminating the need for human labeling or unstable LLM-as-a-judge evaluation.

Evaluation Highlights

Reasoning-optimized models like o3-mini (63.5%) significantly outperform general-purpose models like Llama 4 Maverick (41.5%) across Reasoning Gym tasks.
Algorithmic training transfers broadly: models trained on algorithmic tasks improve by +29.1% on held-out algebra tasks and +22.3% on geometry tasks.
RLVR training on Reasoning Gym improves performance on external benchmarks, yielding +9.7% on MATH and +7.7% on Big-Bench Hard using Qwen2.5-3B-Instruct.

Breakthrough Assessment

9/10

Addresses the critical 'data wall' in reasoning research by replacing static datasets with infinite procedural environments. The strong transfer results to external benchmarks validate that synthetic RLVR training builds real, generalizable skills.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning with Verifiable Rewards (RLVR) where an agent interacts with a procedural environment to solve reasoning tasks.

Inputs: Task-specific prompt generated procedurally (e.g., a math problem, a logic puzzle state).

Outputs: A reasoning trace and a final answer that can be algorithmically verified.

Pipeline Flow

Task Generator (Procedural Algorithms)
Agent (LLM Policy)
Verifier (Algorithmic Oracle)

System Modules

Task Generator (Environment)

Procedurally creates unique task instances based on difficulty and structural parameters

Model or implementation: Python-based algorithmic generators (over 100 types)

Agent

Generates a solution attempt for the given task

Model or implementation: Qwen2.5-3B-Instruct (base model for experiments)

Verifier (Environment)

Checks the agent's answer against the algorithmic ground truth

Model or implementation: Deterministic Python function

Novel Architectural Elements

Library of 100+ parameterized procedural environments explicitly designed for RLVR, contrasting with static dataset architectures.

Modeling

Base Model: Qwen2.5-3B-Instruct

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Optimize policy to maximize verifiable correctness.

Formally: GRPO loss maximizing advantage based on binary correctness reward.
Purpose: Encourage proper output formatting.

Formally: Auxiliary reward component added to the total reward.

Adaptation: Full fine-tuning (implied by context, though not explicitly distinguished from LoRA in text, typically GRPO updates full weights or large subset)

Training Data:

Procedurally generated on-the-fly from Reasoning Gym tasks
Evaluation sets: 50 held-out problems per task/difficulty

Key Hyperparameters:

training_steps: 800 steps (for external transfer experiments)
compute_cost: 1500 A6000 GPU hours (total for all experiments)

Compute: 1500 A6000 GPU hours for the reported experiments

Comparison to Prior Work

vs. GSM8K/MATH: Reasoning Gym offers infinite procedural generation vs. fixed static examples; allows precise difficulty control.
vs. DeepMind Mathematics Dataset [not cited in paper]: Reasoning Gym covers broader domains beyond math (games, logic, cognition) and is explicitly designed for RLVR integration.
vs. OpenAI o1 data strategy: Reasoning Gym provides an open-source alternative to proprietary RLVR data pipelines [not cited in paper].

Limitations

Visual-spatial reasoning tasks (Games, Cognition) remain very difficult for text-based models, showing lower performance.
Steep performance cliffs observed when increasing difficulty, indicating current model competencies may be shallow.
Requires defining algorithmic verifiers for every task, which may be challenging for open-ended or creative domains.

Reproducibility

Code: https://github.com/open-thought/reasoning-gym/

Code and task generators are publicly available at https://github.com/open-thought/reasoning-gym/. Training infrastructure details (GRPO) are mentioned. Base model is open weights (Qwen2.5). Exact training hyperparameters beyond step count are not detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation and RLVR training on procedural tasks.

Benchmarks:

Reasoning Gym (Internal) (Procedural Reasoning Tasks (Algebra, Games, Logic, etc.)) [New]
MATH (Competition Mathematics)
GSM8K (Grade School Math)
Big-Bench Hard (BBH) (Challenging Reasoning Tasks)
MMLU-Pro (Academic and Professional Knowledge)

Metrics:

Accuracy (%)
Reward (Accuracy + Format Reward)
Statistical methodology: Experiments typically involved 3 independent runs on identical evaluation sets of 50 problems.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Zero-shot evaluation reveals a significant gap between reasoning-optimized models and general-purpose models on Reasoning Gym tasks.
Reasoning Gym (Aggregate)	Accuracy	41.5	63.5	+22.0
Reasoning Gym (Aggregate)	Accuracy	40.3	59.5	+19.2
Intra-domain transfer experiments show that training on a subset of domain tasks improves performance on held-out tasks within the same domain.
Reasoning Gym (Algebra)	Accuracy	13.3	25.0	+11.7
Reasoning Gym (Algorithmic)	Accuracy	27.1	34.5	+7.4
Cross-domain transfer experiments demonstrate that skills learned in one domain (e.g., Algorithmic) generalize to distinct domains (e.g., Algebra, Geometry).
Reasoning Gym (Algebra)	Accuracy	13.3	42.4	+29.1
Reasoning Gym (Geometry)	Accuracy	19.6	41.9	+22.3
External benchmark results confirm that Reasoning Gym training transfers to established static benchmarks.
MATH	Accuracy	39.9	49.6	+9.7
Big-Bench Hard	Accuracy	39.9	47.6	+7.7
Curriculum learning experiments show adaptive difficulty progression outperforms fixed difficulty training.
Spell Backward (Len 4)	Accuracy	46.00	86.67	+40.67

Experiment Figures

Zero-shot performance of various models across categories (Easy vs. Hard settings) and the 'difficulty cliff' (performance drop when moving from easy to hard).

Training dynamics (reward curves) for intra-domain transfer across categories.

Comparison of Curriculum vs. Non-Curriculum training dynamics.

Main Takeaways

Procedural data generation effectively overcomes the 'data wall', enabling sustained improvement via RLVR without relying on finite human-labeled datasets.
Algorithmic reasoning training serves as a strong foundation, transferring surprisingly well to mathematical domains like algebra and geometry (Cross-Domain Transfer).
Curriculum learning, where difficulty scales with agent performance, yields significantly better final models than training on uniform difficulty distributions.
A 'difficulty cliff' exists: performance drops sharply as task complexity increases (e.g., graph size, code length), revealing that current models often rely on shallow pattern matching rather than robust reasoning.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (specifically PPO/GRPO)
Procedural Generation
Large Language Models (LLMs)
Reasoning Benchmarks (MATH, GSM8K)

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—using objective, programmatic feedback (correct/incorrect) to train models, typically for math or code, rather than human preference labels.

Procedural Generation: Algorithmic creation of data where content is generated automatically based on parameters rather than manually authored.

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that optimizes a policy based on the relative performance of a group of outputs for the same input.

Zero-shot: Evaluating a model's performance on a task without providing any specific examples of that task in the prompt.

Curriculum Learning: Training strategy where the difficulty of tasks increases progressively as the model improves, rather than random sampling.

ARC: Abstraction and Reasoning Corpus—a benchmark requiring the solution of visual logic puzzles, often challenging for text-only models.

Outcome-based feedback: Reward signals based solely on whether the final answer is correct, without evaluating the intermediate reasoning steps.

GSM8K: A benchmark of grade-school math word problems.

MMLU-Pro: A massive multitask benchmark covering diverse academic and professional subjects.