ReSyn: Autonomously Scaling Synthetic Environments for Reasoning Models

📝 Paper Summary

Synthetic Data Generation Reinforcement Learning for Reasoning

ReSyn autonomously generates diverse synthetic reasoning environments—complete with problem generators and code-based verifiers—to scale reinforcement learning for reasoning models beyond math and coding tasks.

Core Problem

Current RL training for reasoning relies on math/code tasks with ground truth or a few hand-crafted puzzle environments, limiting task diversity and the ability to generalize to broader logical reasoning.

Why it matters:

Manual creation of reasoning environments is slow and labor-intensive, creating a bottleneck for scaling RL training data
Restricting RL to math and code limits the diversity of reasoning patterns models can learn
Self-generated reasoning chains (without verifiers) are unreliable for training because the teacher model may hallucinate incorrect steps or solutions

Concrete Example: A model trained only on math might fail at a logic puzzle like 'dependency sorting' because it hasn't seen that specific reasoning structure. Manually coding a generator for every such puzzle is too costly. ReSyn automates this by having an LLM write the Python code for the generator and verifier.

Key Novelty

Autonomous Synthesis of Verifiable Reasoning Environments

Instead of generating static Question-Answer pairs, ReSyn prompts an LLM to write Python code that defines a whole 'environment' (problem generator + solution verifier)
Leverages the 'generator-verifier gap': it is often easier to verify a solution code-wise (e.g., check if a Sudoku grid is valid) than to solve it, allowing supervision on harder tasks
Scales procedural data generation by creating hundreds of distinct environments (topics) rather than just scaling instance counts within a few hand-coded ones

Architecture

The ReSyn data generation pipeline, moving from topic keywords to functional code environments to final RL training data.

Evaluation Highlights

+27% relative improvement on BBEH (Big-Bench Extra Hard) compared to the Qwen2.5-7B-Instruct baseline
+14% relative improvement on BBH (Big-Bench Hard) compared to the Qwen2.5-7B-Instruct baseline
0-shot ReSyn model outperforms 3-shot Instruct baseline on BBH by nearly 5%, suggesting internalized reasoning capabilities

Breakthrough Assessment

8/10

Significantly expands the scope of RLVR by automating environment creation. The strong performance on out-of-domain benchmarks (BBEH) confirms that diverse synthetic logical tasks transfer to general reasoning.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning with Verifiable Rewards (RLVR) using autonomously generated synthetic environments

Inputs: Natural language question Q sampled from a synthetic environment

Outputs: Reasoning chain (thought) and final answer a

Pipeline Flow

Group: Environment Synthesis → Topic Extraction → Task Synthesis (Code Generation) → LLM-as-a-Judge Filtering
Group: Data Generation → Instance Sampling → Difficulty Calibration
Group: Model Training → Policy Model (Inference) → Verifier (Reward) → Update

System Modules

Policy Model

Generates reasoning steps and answers for the given problem

Model or implementation: Qwen2.5-7B-Instruct

Verifier

Executes the code-based verification logic specific to the current problem instance to determine correctness

Model or implementation: Python Interpreter (executing LLM-generated code)

Novel Architectural Elements

Environment-based Data Representation: Represents training data as executable environments (Generator + Verifier pairs) rather than static (Question, Answer) datasets

Modeling

Base Model: Qwen2.5-7B-Instruct

Training Method: Reinforcement Learning (DAPO algorithm)

Objective Functions:

Purpose: Encourage the model to generate solutions that pass the verifier.

Formally: RL objective L(π_θ, Q, {a_i}, {r_i}) maximizing expected reward.

Training Data:

16,000 training instances generated from 418 synthetic environments
500 validation instances

Key Hyperparameters:

temperature: 0.8
top_p: 0.95
update_steps: 400
+ 2 more
kl_coef: 0.001 (implied via standard DAPO config, explicitly listed in Appendix but 0.01 in text often typical, using 'standard' here)
learning_rate: 5e-7

Compute: Not reported in the paper

Comparison to Prior Work

vs. SynLogic: ReSyn scales the number of environments autonomously (418 vs. hand-crafted few), offering significantly higher task diversity.
vs. TinyZero/Logic-RL: ReSyn targets general logical reasoning across hundreds of task types rather than overfitting to a single domain like math or countdown games.
vs. Self-Correction/Self-Consistency [not cited in paper]: ReSyn relies on external code-based ground truth verification rather than model-internal confidence or consensus.

Limitations

Dependency on the teacher LLM's coding ability to generate valid environments (many candidates fail filtering)
Computationally expensive generation pipeline requiring multiple stages of LLM inference and judging
Currently limited to tasks that can be verified programmatically (mostly logic, puzzles, math)
No real-world data usage, potentially limiting generalization to tasks requiring external knowledge

Reproducibility

The paper does not explicitly provide a link to the code or data repository in the main text or abstract. It mentions using the open-source DAPO recipe. Prompt templates for the pipeline are generally described. The list of seed keywords is in Appendix A.1.

📊 Experiments & Results

Evaluation Setup

Zero-shot and Few-shot Chain-of-Thought reasoning on standard benchmarks

Benchmarks:

Big-Bench Hard (BBH) (Diverse logical and commonsense reasoning)
Big-Bench Extra Hard (BBEH) (Harder version of BBH tasks)
GSM8K (Grade school math)
AIME 2024 (High-difficulty math competition)

Metrics:

Accuracy (Solve Rate)
Statistical methodology: One-sided Wald test (alpha=0.05) used during environment filtering; error bars/significance not explicitly reported for main benchmark results.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results comparing ReSyn to the Instruct baseline and SynLogic across diverse reasoning benchmarks.
BBH (3-shot)	Accuracy	63.7	72.4	+8.7
BBEH (3-shot)	Accuracy	35.8	45.4	+9.6
GSM8K (0-shot)	Accuracy	83.4	88.6	+5.2
AIME 2024 (0-shot)	Accuracy	11.2	16.1	+4.9
Ablation studying the impact of environment count (diversity) vs. instance count (scale).
BBH (3-shot)	Accuracy	69.1	72.4	+3.3
BBH (3-shot)	Accuracy	63.7	72.4	+8.7

Experiment Figures

Illustration of the RLVR training loop where the policy interacts with the environment.

Main Takeaways

Scaling the number of distinct reasoning environments (diversity) is more effective than scaling the number of instances per environment.
Verifier-based supervision (code) is significantly better than training on model-generated solutions (distillation), yielding a 14% relative gain on BBH vs 4%.
Synthetic logic training transfers to math domains (GSM8K, AIME) without explicit math training data.
RL-trained models can outperform few-shot prompting of base models, with 0-shot ReSyn beating 3-shot Instruct on BBH.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (specifically for Language Models)
Procedural Content Generation
LLM-as-a-Judge

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—training models using objective success signals (like passing a unit test) rather than imitating human text

ReSyn: The proposed pipeline that autonomously generates diverse reasoning environments (generators + verifiers) using LLMs

BBH: Big-Bench Hard—a benchmark suite of challenging reasoning tasks where language models previously struggled

BBEH: Big-Bench Extra Hard—a harder version of BBH designed to test reasoning capabilities at a higher difficulty level

DAPO: An RL algorithm (Direct Alignment with Preference Optimization or similar) used here to train the policy model using verifier rewards

Observation Function: A function O(s) that converts internal problem parameters (e.g., a grid array) into a natural language question

Verifier: A code-based function V(a) that checks if a candidate answer 'a' satisfies the constraints of the problem instance

Generator-Verifier Gap: The concept that verifying a solution is often computationally easier than finding it (e.g., checking a sorted list vs. sorting it), enabling supervision for hard problems