OpenCodeReasoning: Advancing Data Distillation for Competitive Coding

📝 Paper Summary

Data Distillation Code Generation Reasoning Models

OpenCodeReasoning is a large-scale dataset of 736k reasoning-augmented coding solutions used to fine-tune standard LLMs, allowing them to outperform similarly sized models trained with reinforcement learning.

Core Problem

While reasoning models like DeepSeek-R1 excel at coding through reinforcement learning, the data and methods to distill these capabilities into smaller, more efficient models are often proprietary or small-scale.

Why it matters:

High-quality human-labeled coding data is scarce and expensive, creating a bottleneck for improving non-reasoning models.
Current open-weight reasoning models rely on complex RL training pipelines, while the potential of pure Supervised Fine-Tuning (SFT) with large-scale reasoning data remains under-explored.
Existing reasoning datasets for code are small (17k-114k samples), limiting the performance gains achievable by distilled student models.

Concrete Example: A standard instruction-tuned model often fails hard competitive programming problems because it jumps to code generation without planning. In contrast, R1 models generate long 'thinking' traces before coding, but smaller models haven't successfully mimicked this behavior at scale.

Key Novelty

OpenCodeReasoning (Large-Scale Reasoning Distillation)

Constructs the largest reasoning-based coding dataset (736k samples) by filtering high-difficulty problems and generating solutions with DeepSeek-R1.
Demonstrates that pure SFT on extensive reasoning traces allows standard models (Qwen2.5) to surpass specialized RL-trained models (R1-Distill-Qwen) without needing RL themselves.
Validates a counter-intuitive filtering strategy: prioritizing instruction diversity and problem hardness over perfect solution correctness (finding that models learn well even from incorrect solutions).

Architecture

Comparison of pass@1 scores on LiveCodeBench for OCR-Qwen models versus R1-Distill-Qwen models across 7B, 14B, and 32B sizes.

Evaluation Highlights

OCR-Qwen-32B achieves 61.8 pass@1 on LiveCodeBench, surpassing OpenAI's O1 and O3-Mini and narrowing the gap with the teacher model DeepSeek-R1 (65.6).
OCR-Qwen-14B-Instruct attains 59.4 pass@1 on LiveCodeBench, outperforming the R1-Distill-Qwen-14B baseline (51.3) by 8.1 absolute points.
OCR-Qwen-7B-Instruct scores 51.3 on LiveCodeBench, beating the R1-Distill-Qwen-7B baseline (38.0) by a massive 13.3 point margin.

Breakthrough Assessment

9/10

Establishes a new state-of-the-art for open-weight coding models via SFT alone, debunking the need for complex RL pipelines if data scale is sufficient. Outperforms OpenAI models in specific benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Competitive programming code generation with reasoning traces

Inputs: Programming problem statement (natural language description)

Outputs: A reasoning trace (chain-of-thought) followed by the solution code

Pipeline Flow

Question Collection & Deduplication
Teacher Generation (DeepSeek-R1)
Filtering & Refinement
Student Fine-Tuning

System Modules

Question Collector (Data Construction)

Aggregates problems from TACO, APPS, CodeContests, and CodeForces

Teacher Generator (Data Construction)

Generates reasoning traces and code solutions for the collected questions

Model or implementation: DeepSeek-R1

Refiner (Data Construction)

Parses and filters generated responses

Student Model

Learns to mimic the reasoning and coding behavior of the teacher via SFT

Model or implementation: Qwen2.5 (7B, 14B, 32B Base & Instruct)

Novel Architectural Elements

Data scaling strategy specifically for reasoning traces: moving from small-scale (17k) to large-scale (736k) reasoning data for code.
Prioritization of 'Hard' problem subset from CodeContests during data scaling to address R1's failure modes.

Modeling

Base Model: Qwen2.5 (7B, 14B, 32B) - both Base and Instruct variants

Training Method: Supervised Fine-Tuning (SFT)

Training Data:

736,712 Python samples (OpenCodeReasoning)
Source breakdown: 359k CodeForces, 237k CodeContests, 114k TACO, 24k APPS

Key Hyperparameters:

learning_rate: 5e-5
batch_size: 256
epochs: 3
+ 4 more
optimizer: AdamW
max_sequence_length: 32,768
warmup_ratio: 0.1
scheduler: CosineAnnealing

Compute: Trained on NVIDIA H100-80GB GPUs (exact count/hours not reported)

Comparison to Prior Work

vs. R1-Distill-Qwen: Uses a significantly larger and more diverse dataset (736k vs unspecified R1 data), achieving higher accuracy without RL.
vs. Bespoke-Stratos/OpenThinker: Demonstrates that while small data induces reasoning format, large-scale data (700k+) is needed for state-of-the-art accuracy.
vs. OlympicCoder: Incorporates explicit reasoning traces ( tags) rather than just standard instruction-response pairs.

Limitations

Execution-based filtering actually hurt performance, leading to reliance on syntax-only filtering (models learn from incorrect solutions).
Including C++ data did not improve Python performance (no cross-lingual transfer observed).
Longer inference token budgets (32k) did not improve performance on Hard problems compared to 16k.
Models occasionally enter unrecoverable reasoning loops despite large context windows.

Reproducibility

Dataset (OpenCodeReasoning) will be fully open-sourced. Data sourcing and filtering methods are detailed. Training hyperparameters provided. Code URL for the specific repo is not explicitly in the text, though the dataset release is promised.

📊 Experiments & Results

Evaluation Setup

Competitive programming generation evaluated on unseen problems

Benchmarks:

LiveCodeBench (Competitive Programming (2408-2502 range))
CodeContests (Competitive Programming)

Metrics:

pass@1 (average of 64 runs for LCB, 16 for CodeContests)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on LiveCodeBench (LCB) showing OCR-Qwen models consistently outperforming baselines of equivalent size.
LiveCodeBench	pass@1	38.0	51.3	+13.3
LiveCodeBench	pass@1	51.3	59.4	+8.1
LiveCodeBench	pass@1	58.1	61.8	+3.7
Performance on CodeContests showing similar dominance of OCR-Qwen models.
CodeContests	pass@1	10.6	18.1	+7.5
CodeContests	pass@1	18.3	24.6	+6.3
Ablation study on data correctness filtering, showing counter-intuitive results.
LiveCodeBench	pass@1	46.1	47.7	+1.6

Experiment Figures

Scaling law curve showing LiveCodeBench Pass@1 performance as dataset size increases from 25k to 736k samples.

Average output token length (reasoning trace length) across Easy, Medium, and Hard problems for different models.

Main Takeaways

Data Scale Matters: Unlike math reasoning where small data suffices, competitive coding requires massive scaling (736k samples) to reach SOTA.
Incorrect Data is Useful: Fine-tuning on incorrect solutions for hard problems yields better models than limiting to correct solutions for easy problems, suggesting the reasoning process is more valuable than final code correctness.
Efficiency: OCR-32B models achieve comparable performance to QwQ-32B while using 20-30% fewer tokens during reasoning.
Reasoning Patterns: Correct solutions exhibit more 'self-evaluation' and 'subgoal' patterns than incorrect ones. Both correct and incorrect traces show increased backtracking on harder problems.

📚 Prerequisite Knowledge

Prerequisites

Supervised Fine-Tuning (SFT)
Chain-of-Thought (CoT) Reasoning
Knowledge Distillation
Reinforcement Learning (RL) for LLMs

Key Terms

pass@1: A metric measuring the percentage of problems where the model's first generated solution passes all unit tests.

DeepSeek-R1: A state-of-the-art open-weights reasoning model that uses reinforcement learning to generate long chain-of-thought traces before answering.

SFT: Supervised Fine-Tuning—training a pre-trained model on a labeled dataset of inputs and outputs.

Chain-of-Thought: A prompting/training technique where the model generates intermediate reasoning steps before the final answer.

Tree Sitter: A parser generator tool used to build syntax trees for source code, used here to verify syntactic correctness of generated solutions.

Nucleus Sampling: A text decoding method (top-p) that samples from the smallest set of tokens whose cumulative probability exceeds a threshold p.

SGLang: A structured generation language and engine used for efficient LLM inference.

IOI: International Olympiad in Informatics—a prestigious competitive programming contest, used here as a benchmark for C++ performance.