Scaling Data Difficulty: Improving Coding Models via Reinforcement Learning on Fresh and Challenging Problems

📝 Paper Summary

Code Generation Data Curation for LLMs Reinforcement Learning for Code

MicroCoder improves code generation models by using an LLM-based 'predict-calibrate-select' framework to filter out simplistic problems and retain only fresh, difficult competitive programming challenges for training.

Core Problem

Existing coding datasets suffer from difficulty imbalance (dominated by simple problems), lack of recency (leading to data leakage), inconsistent formats, and poor data quality (noise/missing test cases).

Why it matters:

Training on easy problems fails to drive model improvement on complex algorithmic tasks where capabilities are most stretched
Models often produce algorithmically correct solutions in incorrect formats (e.g., function completion vs. standard I/O) due to inconsistent training data
Stale benchmarks allow models to memorize solutions from pre-training rather than learning to generalize to unseen problems

Concrete Example: Web-collected problems often contain incomplete descriptions or excessive test cases (hundreds per problem) that pause training. Additionally, a mix of LeetCode-style (function completion) and OJ-style (I/O) problems without clear formatting instructions causes models to fail execution despite correct logic.

Key Novelty

Predict-Calibrate-Select Difficulty Filtering

Uses an LLM to score problem difficulty across five dimensions (e.g., Algorithmic Thinking, Implementation) instead of relying on platform tags
Calibrates these predicted scores against empirical pass rates from a 'thinking' model to establish ground-truth difficulty boundaries
Systematically removes simplistic problems (scoring < 2.5) to create a difficulty-dense dataset that maximizes learning efficiency

Evaluation Highlights

Achieves up to 17.2% relative gains in overall performance on medium and hard problems compared to baselines
Delivers 3x larger performance gains within 300 training steps compared to widely-used baseline datasets of comparable size
Reduces the ratio of easy problems in the training set from approximately 40% to under 20% via the filtering framework

Breakthrough Assessment

7/10

Presents a strong systematic framework for data difficulty scaling that yields significant efficiency gains (3x faster convergence). While the techniques (LLM-as-judge, filtering) are known, the specific application to difficulty calibration for code RL is impactful.

⚙️ Technical Details

Problem Definition

Setting: Constructing a high-quality, difficulty-balanced dataset for training code generation models via Reinforcement Learning

Inputs: Raw competitive programming problems collected from web sources and public datasets (e.g., TACO, CodeContests)

Outputs: Curated 'MicroCoder' dataset containing verified problem-solution-test triplets

Pipeline Flow

Collection (Sources: TACO, OlympicCoder, Web)
Processing Group: Translation → Noise Removal → Format Standardization → Test Case Optimization
Filtering Group: Hard Filtering → Automatic Difficulty Filtering (Predict-Calibrate-Select)
Verification (Manual check)

System Modules

Data Collector

Aggregates problems from public datasets and web sources

Model or implementation: N/A

Test Case Optimizer

Generates or selects test cases to ensure validity and manage data volume

Model or implementation: LLM (specific model not named for generation)

Difficulty Predictor (Filtering)

Assess problem complexity across five dimensions (e.g., Algorithmic Thinking, Implementation)

Model or implementation: GPT-4O

Calibrator (Filtering)

Establishes difficulty boundaries by checking actual model failure rates

Model or implementation: Qwen-3-4B-thinking

Novel Architectural Elements

Predict-Calibrate-Select framework: A feedback loop where LLM predictions are calibrated against empirical model pass rates to define dynamic filtering thresholds

Modeling

Base Model: Qwen-3-4B-thinking (used for difficulty calibration)

Training Method: Reinforcement Learning (GRPO and variants mentioned)

Training Data:

MicroCoder Dataset (13,300 problems)
Split validation: 16-gram similarity analysis with 0.22 threshold to ensure 0% overlap with LiveCodeBench test set

Key Hyperparameters:

max_test_cases_per_problem: 15

Compute: Not reported in the paper

Comparison to Prior Work

vs. KodCode: MicroCoder focuses on *real* competitive problems rather than synthetic generation to ensure high difficulty and diversity
vs. DeepCoder dataset: MicroCoder includes strictly unseen, recent problems and applies difficulty-aware filtering to remove simplistic tasks
vs. Standard curation: Uses 'Predict-Calibrate-Select' to empirically verify difficulty rather than trusting platform labels

Limitations

Relies on LLM (GPT-4O) for difficulty prediction, which may have its own biases
Test case generation depends on the existence of a correct reference solution
Massive test case volume (hundreds per problem) required capping at 15, potentially reducing coverage
Requires empirical calibration (running models on problems), which is compute-intensive

Reproducibility

The MicroCoder dataset is described in detail (13,300 problems), but the paper does not explicitly provide a repository URL in the text. Training uses GRPO but exact hyperparameters (learning rate, batch size) are not detailed in the provided text snippets. Qwen-3-4B-thinking is used for calibration.

📊 Experiments & Results

Evaluation Setup

Code generation on unseen competitive programming problems

Benchmarks:

LiveCodeBench v6 (Competitive Programming (strictly unseen))
AtCoder (Competitive Programming)
LeetCode (Coding Interview Problems)

Metrics:

Accuracy (Pass rate)
Pass@1 (implied by binary score description)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Difficulty filtering significantly alters the dataset composition, shifting the distribution towards harder problems.
MicroCoder construction	Easy Problem Ratio	40	20	-20
MicroCoder construction	Dataset Reduction	100	70	-30

Main Takeaways

Difficulty-aware curation (MicroCoder) yields 3x larger performance gains within the first 300 training steps compared to baselines, indicating much higher training efficiency.
The dataset achieves up to 17.2% relative gains on medium and hard problems, validating that removing easy data helps models generalize better to difficult tasks.
Privately collected datasets (part of MicroCoder) contain more test cases and show distinct clusters compared to open-source sets like TACO, providing complementary coverage.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) in the context of LLM training
Competitive Programming formats (OJ vs. LeetCode)
Pass@k evaluation metrics

Key Terms

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm often used for reasoning models that normalizes rewards within a group of generated outputs

OJ-style: Online Judge style—a problem format requiring a program to read from Standard Input (stdin) and write to Standard Output (stdout), unlike function completion

LiveCodeBench: A benchmark for code generation focusing on 'fresh' problems released after model training cutoffs to prevent memorization

Pass@k: A metric measuring the probability that at least one of k generated code solutions passes all test cases

SFT: Supervised Fine-Tuning—training a model on labeled examples

Train-test leakage: When problems in the test set accidentally appear in the training set, artificially inflating performance scores