Evaluation Setup
Zero-shot and few-shot evaluation on code generation, completion, and reasoning tasks
Benchmarks:
- HumanEval (Python function generation from docstrings)
- MBPP (Python programming problems)
- DS-1000 (Data Science workflows (7 libraries))
- LeetCode Contest (Competitive programming problems (hard)) [New]
- CrossCodeEval (Cross-file code completion)
Metrics:
- Pass@1
- Exact Match (EM)
- Edit Similarity (ES)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| DeepSeek-Coder outperforms comparable open-source models on standard Python generation benchmarks. |
| HumanEval |
Pass@1 |
48.2 |
56.1 |
+7.9
|
| MBPP |
Pass@1 |
55.2 |
66.0 |
+10.8
|
| Instruction tuning yields performance surpassing GPT-3.5 on HumanEval. |
| HumanEval |
Pass@1 |
76.2 |
79.3 |
+3.1
|
| Cross-file completion results demonstrate the efficacy of repository-level pre-training. |
| CrossCodeEval (Python) |
Exact Match |
7.32 |
9.53 |
+2.21
|
| On hard, unseen competitive programming problems, DeepSeek-Coder dominates open-source baselines. |
| LeetCode Contest |
Pass@1 |
9.4 |
27.8 |
+18.4
|
Main Takeaways
- Repo-level pre-training (Topological Sort) significantly improves cross-file code completion capabilities compared to file-level training.
- A 50% PSM (Prefix-Suffix-Middle) rate in FIM training balances infilling capability and left-to-right generation better than 100% FIM.
- The 6.7B model is highly efficient, often outperforming the much larger CodeLlama-34B on multiple benchmarks like MBPP and HumanEval.
- Chain-of-Thought (CoT) prompting further enhances performance on complex reasoning tasks like LeetCode Hard problems.