Evaluation Setup
Train on APPS+; Evaluate on APPS+ (test set), MBPP (zero-shot), and HumanEval (zero-shot).
Benchmarks:
- APPS+ (Program Synthesis (Python)) [New]
- MBPP (Introductory Python Programming)
- HumanEval (Python Coding Problems)
Metrics:
- Pass@1
- Pass@k
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance on the proposed APPS+ dataset, showing StepCoder outperforms baselines across all difficulty levels. |
| APPS+ (Overall) |
Pass@1 |
31.7 |
36.1 |
+4.4
|
| APPS+ (Competition) |
Pass@1 |
6.4 |
8.6 |
+2.2
|
| APPS+ (Overall) |
Pass@1 |
29.8 |
36.1 |
+6.3
|
| Zero-shot generalization to other benchmarks (MBPP and HumanEval) after training on APPS+. |
| HumanEval |
Pass@1 |
78.0 |
78.7 |
+0.7
|
| MBPP |
Pass@1 |
65.2 |
67.0 |
+1.8
|
| Ablation study on APPS+ validation set demonstrating the contribution of CCCS and FGO components. |
| APPS+ (Overall) |
Pass@1 |
34.6 |
36.1 |
+1.5
|
| APPS+ (Overall) |
Pass@1 |
35.5 |
36.1 |
+0.6
|
Main Takeaways
- RL-based methods consistently outperform SFT and base models on code generation tasks.
- CCCS is particularly effective for 'Competition' level problems, suggesting curriculum learning helps explore complex logic paths.
- FGO reduces the noise in policy updates, leading to better optimization compared to vanilla PPO.
- SFT alone on a single dataset (APPS+) can degrade generalization to other benchmarks (MBPP/HumanEval), while RL tends to improve or maintain generalization.