DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Competitive Programming Code Generation

DRIVE is a two-stage reinforcement learning framework that first expands output diversity on medium problems and then uses a hard-focus curriculum with large rollout budgets to master difficult competitive programming tasks.

Core Problem

Standard RLVR struggles to learn difficult competitive programming problems because models quickly converge to low-entropy simple modes or fail to explore complex solutions effectively when training uniformly across all difficulties.

Why it matters:

Competitive programming requires handling strict edge cases and long reasoning chains, which are hard to learn from simple SFT (Supervised Fine-Tuning) data
Uniform RL training wastes compute on easy problems while failing to generate valid training signals for hard ones due to insufficient exploration (rollouts)
Existing work focuses on math benchmarks (AIME) or algorithms, neglecting data curation strategies for code generation

Concrete Example: When trained with standard RL, the model exhibits low entropy and repetitive patterns (e.g., loops leading to truncation) on hard Codeforces problems. It fails to generate a single correct solution even after many steps because the exploration budget (rollouts) is too small to find a sparse successful path.

Key Novelty

Two-Stage Entropy-Then-Focus RL Training

Stage 1 (Entropy Expansion): Trains on a diverse set of medium-difficulty problems with moderate exploration to increase the model's output variety and stop it from collapsing into repetitive failure modes.
Stage 2 (Hard-Focus Curriculum): Filters the dataset to keep only the hardest problems and drastically increases the exploration budget (rollouts), forcing the model to find solutions for the 'long tail' of difficulty.

Architecture

The overall training pipeline of DRIVE, illustrating the transition from SFT to two distinct RL stages.

Evaluation Highlights

+58.3% relative improvement on Codeforces Weekly OJ compared to the SFT baseline, achieving state-of-the-art among 32B models
Achieves 0.182 Pass Rate on Codeforces, outperforming the previous best 32B model (OpenReasoning-Nemotron-32B at 0.132) by a significant margin
Comparable performance to much larger models like DeepSeek-V3.1 on LiveCode benchmarks despite having only 32B parameters

Breakthrough Assessment

8/10

Strong practical contribution demonstrating that data curation (curriculum and difficulty focusing) is as critical as algorithms in RLVR. The performance gains on hard benchmarks are substantial for the model size.

⚙️ Technical Details

Problem Definition

Setting: Code generation for competitive programming problems with verifiable test cases

Inputs: Problem description (natural language)

Outputs: Executable code solution passing all hidden test cases

Pipeline Flow

Data Curation (SFT Data with Difficulty Labeling)
Supervised Fine-Tuning (SFT with Twice Hard Learning)
RL Stage 1 (Entropy Expansion)
RL Stage 2 (Hard-Focus Curriculum)

System Modules

SFT Trainer

Distill knowledge from strong teacher models and classify problem difficulty

Model or implementation: Qwen2.5-32B-Instruct

Entropy Expansion RL (Stage 1) (Reinforcement Learning)

Train on uniform medium-difficulty problems to increase output diversity and reduce repetition

Model or implementation: SFT Base Model

Hard-Focus Curriculum RL (Stage 2) (Reinforcement Learning)

Train specifically on high-quality, challenging problems with massive exploration

Model or implementation: Stage 1 RL Checkpoint

Novel Architectural Elements

Two-stage RL pipeline separating 'Entropy Expansion' (diversity) from 'Hard-Focus' (difficulty mastery)
Twice Hard Learning strategy in SFT (duplicating hard samples based on classifier labels rather than rejection sampling)

Modeling

Base Model: Qwen2.5-32B-Instruct

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Optimize policy to maximize verifiable reward (passing test cases).

Formally: Standard GRPO objective (maximizing advantage of successful rollouts within a group).

Training Data:

SFT: 1.27M raw prompts refined to 470k via Arena Learning
RL Stage 1: 9k open-source prompts
RL Stage 2: 175 LiveCode V6 high-quality problems

Key Hyperparameters:

learning_rate: 1e-5 (SFT)
global_batch_size: 512 (SFT)
rl_stage1_rollouts: 8
+ 5 more
rl_stage1_steps: 32
rl_stage2_rollouts: 64
rl_stage2_steps: 32k (total across phases)
rl_stage1_seq_len: 24k
rl_stage2_seq_len: 32k

Compute: SFT: 256 GPUs. RL compute not explicitly totaled but uses large rollout budgets (64) for Stage 2.

Comparison to Prior Work

vs. DeepSeek R1: DRIVE uses a two-stage curriculum with variable rollout budgets specifically for code, rather than uniform training
vs. OpenReasoning-Nemotron-32B: DRIVE achieves higher accuracy on hard Codeforces problems (+0.056) via targeted curriculum RL
vs. Standard RLVR (e.g., DAPO): Focuses on data curation (entropy then hard-focus) rather than algorithm modification

Limitations

Heavy reliance on computational resources for Stage 2 (64 rollouts per prompt)
Risk of overfitting to the specific hard problems in Stage 2 if dataset is too small (LiveCode V6 is only 175 problems)
Evaluation on LeetCode/Codeforces Weekly is sensitive to the specific contests chosen (though authors try to avoid leakage)
Stage 1 requires a diverse dataset to work effectively; insufficient diversity may lead to mode collapse despite the method

Reproducibility

SFT data distilled from DeepSeekR1-0528. RL data uses public LiveCode V6. Code not provided. Detailed hyperparameters for SFT provided (LR, batch size), but exact GRPO hyperparameters (clip epsilon, beta) not listed.

📊 Experiments & Results

Evaluation Setup

Competitive programming code generation evaluated against hidden test cases

Benchmarks:

LiveCode08-11 (Competitive Programming (Validation Set))
LiveCodeV5 (Competitive Programming (Validation Set))
LiveCodeV6 (Competitive Programming (Training/Validation))
Codeforces Weekly (Hard Competitive Programming (External))
LeetCode Weekly (Medium Competitive Programming (External))

Metrics:

Pass Rate (Pass@1 inferred from context and comparison style)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison against the state-of-the-art 32B model (OpenReasoning-Nemotron-32B) shows DRIVE achieving superior performance across all benchmarks.
Codeforces Weekly	Pass Rate	0.132	0.182	+0.050
LiveCode08-11	Pass Rate	0.623	0.699	+0.076
LiveCodeV6	Pass Rate	0.619	0.698	+0.079
Ablation study showing the impact of RL training over the SFT baseline.
LiveCode08-11	Pass Rate	0.622	0.699	+0.077
LiveCodeV6	Pass Rate	0.595	0.698	+0.103

Experiment Figures

Training dynamics showing accuracy improvement over time for problem clusters of different difficulty.

Ablation of rollout counts (sampling budget) for single-case training.

Main Takeaways

The two-stage approach (Entropy Expansion + Hard-Focus) consistently outperforms SFT baselines and competing 32B models, particularly on hard Codeforces problems (+58.3% relative to SFT).
Large rollout budgets (64) are essential for learning hard problems; standard budgets (8) are insufficient for finding solutions in the sparse reward landscape of competitive programming.
Difficulty-aware training is crucial: mixing easy problems into the second stage dilutes the reward signal and can degrade performance on hard benchmarks.
Scaling trends hold: applying the same recipe to a larger internal MoE model yielded further gains (+15.17% on LeetCode Weekly), suggesting the method scales with model size.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Verifiable Rewards (RLVR)
Proximal Policy Optimization or Group Relative Policy Optimization (GRPO)
Supervised Fine-Tuning (SFT)
Competitive Programming (LeetCode/Codeforces)

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—training models using outcomes (like passing unit tests) as ground-truth rewards rather than a learned reward model

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same prompt to reduce variance, often used without a critic model

SFT: Supervised Fine-Tuning—training a model on high-quality input-output pairs before RL

Entropy Expansion: A training phase designed to increase the randomness and diversity of the model's outputs to prevent it from getting stuck in repetitive failure patterns

Rollout: A single attempt by the model to generate a solution during RL training; '8 rollouts' means the model generates 8 different solutions for one problem to estimate gradients

Pass@k: A metric measuring the probability that at least one of the top k generated solutions is correct

Curriculum Learning: Training on easier tasks first or organizing training data by difficulty to help the model learn progressively

MoE: Mixture-of-Experts—a model architecture where different sub-models (experts) are activated for different inputs, allowing large capacity with lower inference cost

Arena Learning: An iterative data selection method where a model is trained on subsets of data to identify and retain 'hard' samples that it consistently gets wrong

OJ: Online Judge—a system that automatically tests submitted code against hidden test cases (e.g., LeetCode, Codeforces)