EvoCoT: Overcoming the Exploration Bottleneck in Reinforcement Learning

📝 Paper Summary

Reinforcement Learning with Verifiable Reward (RLVR) Curriculum Learning for LLMs Chain-of-Thought (CoT) Optimization

EvoCoT overcomes exploration bottlenecks in reinforcement learning by first generating answer-guided reasoning paths for unsolved problems, then training the model on progressively shorter versions of these paths to gradually increase difficulty.

Core Problem

In Reinforcement Learning with Verifiable Rewards (RLVR), LLMs often face exploration bottlenecks on hard problems where rollout accuracy is near zero, resulting in sparse rewards and failed learning.

Why it matters:

Current RLVR methods (like GRPO) discard hard problems where the model fails to find a solution, wasting valuable training data
Reliance on teacher models for distillation limits scalability and is costly for flagship models
Existing curriculum methods often filter out hard problems rather than enabling the model to solve them, restricting reasoning improvement

Concrete Example: On the GSM8K training set, Qwen2.5-7B fails to solve 8.8% of problems even after standard RL training because the solution space is too vast to explore randomly. EvoCoT enables the model to solve these by starting with full reasoning traces and gradually removing steps.

Key Novelty

Self-Evolving Curriculum via Reverse CoT Reduction

Generates reasoning paths for unsolved problems by conditioning the LLM on the ground-truth answer, then filtering for consistency (Answer-Guided Generation)
Creates a natural curriculum by progressively truncating these self-generated paths from the end, forcing the model to complete increasingly larger portions of the reasoning independently

Architecture

Conceptual diagram of EvoCoT's two-stage process: Self-Generation and Curriculum Learning.

Evaluation Highlights

+21.7% accuracy improvement on previously unsolved training problems for R1-Qwen-1.5B compared to GRPO
Achieves 53.5% pass@1 on MATH benchmark with Qwen2.5-7B, outperforming SimpleRL (51.2%) and SFT (36.9%)
R1-Qwen-1.5B reaches 51.6% on the challenging Olympiad Bench, setting a new high among compared methods

Breakthrough Assessment

8/10

Elegantly solves the exploration bottleneck without external supervision or teacher models. The reverse-step curriculum is a simple yet highly effective way to leverage hard failures.

⚙️ Technical Details

Problem Definition

Setting: Post-training of LLMs using Reinforcement Learning with Verifiable Rewards (RLVR) on math problems

Inputs: Math problem Q and final answer A (no ground truth CoT required)

Outputs: Step-by-step reasoning chain C followed by predicted answer A_hat

Pipeline Flow

Answer-Guided Reasoning Path Self-Generation (Stage 1)
Step-Wise Curriculum Learning (Stage 2)
Iterative Self-Evolution (Loop)

System Modules

CoT Generator (Stage 1: Generation)

Generate candidate reasoning chains for unsolved problems by conditioning on the correct answer

Model or implementation: The LLM being trained (e.g., Qwen2.5-7B)

Consistency Filter (Stage 1: Generation)

Verify if the generated reasoning chain C actually leads to the correct answer A when provided as context

Model or implementation: Rule-based verifier

Curriculum Scheduler (Stage 2: Training)

Create training samples by progressively truncating the tail steps of the verified reasoning chain

Model or implementation: Heuristic logic

RL Trainer (Stage 2: Training)

Optimize the policy using RLVR on the curriculum samples

Model or implementation: GRPO (Group Relative Policy Optimization)

Novel Architectural Elements

Iterative two-stage loop: Self-generate guided CoTs -> Train on truncated CoTs -> Update Model -> Repeat
Reverse-step curriculum: Defining 'difficulty' by the number of reasoning steps removed from the end of a trajectory

Modeling

Base Model: Qwen2.5-7B, DeepSeek-Math-7B, Llama3.1-8B, DeepSeek-R1-Distill-Qwen-1.5B

Training Method: GRPO (Group Relative Policy Optimization)

Objective Functions:

Purpose: Maximize expected reward relative to a group baseline.

Formally: GRPO objective (standard implementation)

Adaptation: Full fine-tuning (implied by context of RLVR on base models)

Trainable Parameters: All parameters (standard for these sizes)

Training Data:

Problems from GSM8K and MATH training sets where the initial model failed 8/8 rollouts
Self-generated CoTs sampled 8 times with temperature 1.0

Key Hyperparameters:

sampling_temperature: 1.0 (generation), 0.6 (evaluation)
num_rollouts: 8 (for failure detection)
context_length: 8192
+ 2 more
max_training_steps: Not explicitly reported in the paper
learning_rate: Not explicitly reported in the paper

Compute: 8x A100 (40GB) GPUs

Comparison to Prior Work

vs. GRPO: EvoCoT adds a curriculum of self-generated, truncated CoTs to solve exploration bottlenecks
vs. RORL: RORL filters out hard problems; EvoCoT specifically targets them for training
vs. LUFFY/TAPO: EvoCoT does not require a teacher model or external thought library; it is fully self-evolving
+ 1 more
vs. R3/AdaBack [not cited in paper]: Similar to methods using partial CoT, but EvoCoT explicitly targets *failed* problems and iterates the generation process to improve the curriculum itself

Limitations

Performance gains saturate after 1-2 iterations of self-evolution
Weaker base models (e.g., Llama3.1-8B) show minimal improvement or degradation, likely due to poor quality of self-generated CoTs
Computationally intensive due to the generation and verification of multiple reasoning paths for every failed problem
Relies on the availability of final answers for verification (cannot apply to open-ended tasks without verifiers)

Reproducibility

Code: https://github.com/gtxygyzb/EvoCoT

Code is publicly available at https://github.com/gtxygyzb/EvoCoT. Base models and datasets are public. Hyperparameters for generation and evaluation are provided, but exact LR/batch size for RL training are referenced to Appendix C (not fully detailed in text).

📊 Experiments & Results

Evaluation Setup

Math reasoning tasks using Chain-of-Thought generation

Benchmarks:

GSM8K (Grade school math)
MATH (Challenging competition math)
AIME 2024 (Competition math)
AMC 2023 (Competition math)
Minerva Math (Math reasoning)
Olympiad Bench (Olympiad-level math)

Metrics:

Pass@1 (Greedy decoding / single sample accuracy)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
EvoCoT consistently improves reasoning capability across diverse benchmarks compared to standard GRPO (SimpleRL), with particularly strong gains on harder datasets like MATH.
MATH	Pass@1	40.3	53.5	+13.2
MATH	Pass@1	54.2	65.7	+11.5
Olympiad Bench	Pass@1	51.6	51.6	0.0
GSM8K	Pass@1	67.9	91.4	+23.5
EvoCoT successfully enables models to solve training problems that were previously unsolved (failed 8/8 rollouts). Table 2 results.
MATH (Training Set Unsolved)	Accuracy	68.2	89.9	+21.7

Experiment Figures

Line plot tracking the number of correct rollouts during training iterations for EvoCoT vs. GRPO on hard problems.

Main Takeaways

EvoCoT enables LLMs to solve previously unsolved 'hard' problems in the training set by creating a bridge from answer-guided CoT to autonomous reasoning.
Stronger base models (Qwen, DeepSeek-R1) benefit significantly more than weaker ones (Llama-3.1-8B), suggesting a minimum capability threshold for effective self-evolution.
The method generalizes well to unseen benchmarks (AIME, Minerva), proving it teaches reasoning rather than just memorizing training answers.
Self-evolution saturates quickly (1-2 iterations), indicating that the primary gain comes from unlocking latent capability rather than indefinite capability scaling.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Verifiable Reward (RLVR)
Chain-of-Thought (CoT) prompting
Curriculum Learning

Key Terms

RLVR: Reinforcement Learning with Verifiable Reward—training LLMs using binary rewards based on whether the final answer is correct

Exploration bottleneck: A situation in RL where the agent rarely or never discovers a high-reward action (correct solution), preventing it from learning

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that normalizes rewards within a group of outputs for the same input to reduce variance

SFT: Supervised Fine-Tuning—training a model on labeled examples of inputs and desired outputs

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

Rollout: A single execution of the current policy (model) to generate a full response from a prompt

Pass@k: A metric measuring the probability that at least one of k generated samples is correct

Curriculum Learning: A training strategy where the model learns from easy examples before progressing to harder ones

Distillation: Training a smaller 'student' model to mimic the behavior or outputs of a larger 'teacher' model

Sparse rewards: When the agent receives non-zero rewards very infrequently, making it difficult to determine which actions led to success