X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests

📝 Paper Summary

Code Generation Competitive Programming Synthetic Data Generation

X-Coder demonstrates that a fully synthetic data pipeline, using domain-specific feature evolution and dual-verification of solutions and tests, can train expert-level competitive programming models without real-world data.

Core Problem

Training expert-level code reasoning models is bottlenecked by the scarcity of high-quality competitive programming data; existing real-world datasets are small and contaminated, while standard synthetic methods lack the complexity and correctness rigor required.

Why it matters:

Real-world competitive programming problems are finite and heavily exhausted, limiting scaling laws for reasoning models
Off-the-shelf synthesis methods often yield trivial or ill-defined tasks that fail to challenge models
Without rigorous verification, synthetic solutions and tests contain noise that pollutes SFT and misleads RL

Concrete Example: Generating a competitive task often results in simple problems solvable by basic logic. Without strict verification, a synthetic solution might pass weak test cases but fail on edge cases, providing false rewards during RL training.

Key Novelty

Domain-Adapted Feature Evolution & Dual-Verification Synthesis

Evolve tasks not from generic instructions but from specific 'competition-related features' extracted from code snippets, using a two-stage process to ensure complexity
Implement a dual-verification strategy where candidate solutions vote to establish ground truth outputs for tests, and then tests (weighted by difficulty) verify the best solution

Architecture

The synthesis pipeline for constructing the X-Coder dataset.

Evaluation Highlights

+6.7 pass@8 improvement on LiveCodeBench (SFT stage) compared to OpenCodeReasoning (real-world data) when trained on equal tokens
X-Coder-7B achieves 62.9% pass@8 on LiveCodeBench v5, outperforming larger models like DeepCoder-Preview-14B
Synthetic data scaling shows monotonic improvement: scaling unique tasks is significantly more effective than scaling solutions per task

Breakthrough Assessment

8/10

Strong empirical evidence that fully synthetic data can replace real-world data for complex reasoning tasks. The rigorous dual-verification pipeline addresses key reliability issues in synthetic data.

⚙️ Technical Details

Problem Definition

Setting: Code generation for competitive programming tasks

Inputs: Natural language problem description q

Outputs: Executable Python code solution A

Pipeline Flow

Task Formulation (Feature Extraction -> Scenario Generation)
Test Input Generation (Prompt-based & Tool-based)
Solution Sampling (Generate multiple candidates)
Dual-Verification (Vote on tests -> Select golden solution)

System Modules

Task Generator (Data Synthesis)

Create novel competitive programming problems

Model or implementation: GPT-4o / GPT-o3-mini

Test Input Generator (Data Synthesis)

Generate inputs for test cases

Model or implementation: DeepSeek-R1-0528

Solution Sampler (Data Synthesis)

Generate candidate solutions for tasks

Model or implementation: DeepSeek-R1-0528 and Qwen3-235B-Thinking

Dual-Verifier

Establish ground truth for tests and select best solution

Model or implementation: Code execution sandbox

Novel Architectural Elements

Dual-verification loop: Solutions vote on test outputs -> Test outputs select best solution (mutual validation)
Domain-specific feature evolution: Explicit extraction of algorithmic features from code snippets to drive task generation

Modeling

Base Model: Qwen2.5-Coder-7B-Instruct and Qwen3-8B-Base

Training Method: SFT followed by RL (GRPO)

Objective Functions:

Purpose: SFT standard language modeling loss.

Formally: Maximize log-likelihood of target tokens given input.
Purpose: RL reward maximization.

Formally: Reward = fraction of passed tests in T_golden.

Adaptation: Full fine-tuning

Training Data:

200k fully synthetic tasks (post-filtering)
SFT: (q, A_golden) pairs
RL: (q, T_golden) pairs

Key Hyperparameters:

learning_rate: 5e-5 (SFT)
batch_size: 128 (SFT)
epochs: 8 (SFT)

Compute: SFT: 8 epochs. RL costs not explicitly detailed beyond infrastructure.

Comparison to Prior Work

vs. EpiCoder: X-Coder uses domain-specific feature extraction/evolution rather than general prompts
vs. OpenCodeReasoning: X-Coder uses fully synthetic data vs. real-world data
vs. DeepCoder: X-Coder achieves higher performance with a smaller (7B) base model using synthetic data
+ 1 more
vs. AceReason-Nemotron [not cited in paper]: X-Coder focuses on code execution feedback rather than reward modeling for synthetic data selection

Limitations

Reliance on powerful closed-source models (GPT-4, GPT-o3) for the synthesis pipeline makes the generation process expensive
Verification is computationally intensive (millions of executions required)
High reasoning effort (long CoT) solutions converge slower during training
Limited analysis on languages other than Python

Reproducibility

Code: https://github.com/JieWu02/X-Coder

Code and resources will be made publicly available. Uses closed-source models (GPT-4o, GPT-o3-mini) for data synthesis. Base models (Qwen, DeepSeek) are open weights.

📊 Experiments & Results

Evaluation Setup

Competitive programming code generation

Benchmarks:

LiveCodeBench v5 (Competitive Programming)
LiveCodeBench v6 (Competitive Programming)

Metrics:

Pass@8 (avg@8)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on LiveCodeBench v5 show X-Coder outperforming baselines of similar and larger sizes.
LiveCodeBench v5	avg@8	46.1	62.9	+16.8
LiveCodeBench v5	avg@8	55.8	62.9	+7.1
LiveCodeBench v5	pass@1	43.7	62.7	+19.0
LiveCodeBench v5	avg@8	53.6	60.3	+6.7
LiveCodeBench v5	pass@1	43.1	51.3	+8.2

Experiment Figures

Scaling trends for synthetic data quantity and diversity.

RL training trajectories for models with different SFT initializations.

Main Takeaways

Fully synthetic data is sufficient and superior: Outperforms real-world data (OpenCodeReasoning) by 6.7 points when controlled for token count.
Dual-verification is critical: Filtering solutions via consensus voting and weighted test cases yields significant gains (+8.2%) over raw samples.
Task diversity > Solution diversity: Scaling unique tasks (64k * 1 solution) is much more effective than scaling solutions (16k * 4 solutions).
Good-gets-Better in RL: A stronger SFT initialization leads to consistently higher RL gains, emphasizing the need for high-quality SFT data.

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Large Language Models (LLMs) for code
Understanding of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL)
Basic knowledge of competitive programming platforms (Codeforces, LeetCode)

Key Terms

SFT: Supervised Fine-Tuning—training a model on labeled examples (problem, solution pairs) to learn the desired behavior

RL: Reinforcement Learning—training a model to maximize a reward signal, here based on passing test cases

CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same input to reduce variance

AST: Abstract Syntax Tree—a tree representation of the abstract syntactic structure of source code, used here to check for syntax errors

Dual-verification: A strategy proposed in this paper that cross-checks synthetic solutions against synthetic test cases to filter out incorrect data

Pass@k: A metric estimating the probability that at least one of the top k generated solutions is correct

LiveCodeBench: A benchmark for code generation that focuses on recent competitive programming problems to avoid data contamination

TACO: A large-scale dataset of competitive programming problems used here as a source for feature extraction