← Back to Paper List

DeepSeek-Coder-V2: Breaking the barrier of closed-source models in code intelligence

Deepseek
DeepSeek-AI
arXiv, 6/2024 (2024)
Pretraining RL Reasoning Benchmark

📝 Paper Summary

Code Generation Mathematical Reasoning Large Language Models (LLMs) Mixture-of-Experts (MoE)
DeepSeek-Coder-V2 is an open-source Mixture-of-Experts code model that achieves performance comparable to GPT4-Turbo by continuing pre-training on a massive 6 trillion token corpus of code and math.
Core Problem
Open-source code models have improved but still lag significantly behind state-of-the-art closed-source models like GPT4-Turbo and Claude 3 Opus in coding and mathematical reasoning tasks.
Why it matters:
  • Closed-source dominance limits accessibility and research transparency in high-performance code intelligence
  • Prior open-source models lacked the scale and data diversity to bridge the gap with top-tier proprietary models
  • Existing models often support limited programming languages (e.g., ~86) and shorter context windows (e.g., 16K)
Concrete Example: While models like StarCoder2 handle standard languages well, they may fail on less common languages or complex math problems where closed models like GPT-4 excel. DeepSeek-Coder-V2 expands language support from 86 to 338 and matches GPT-4 performance on benchmarks like HumanEval and MATH.
Key Novelty
Large-Scale MoE Code Model with Multi-Source Pre-training
  • Leverages a Mixture-of-Experts (MoE) architecture to scale up parameters (236B total) while keeping inference costs low (21B active), enabling efficient large-scale performance
  • Continues pre-training from a general LLM checkpoint using a massive 6 trillion token dataset specifically curated for code (60%), math (10%), and natural language (30%)
  • Significantly expands programming language support to 338 languages and context length to 128K tokens
Evaluation Highlights
  • Achieves 90.2% on HumanEval and 76.2% on MBPP, outperforming all open-source models and matching GPT4-Turbo
  • Attains 75.7% accuracy on the MATH benchmark, rivaling GPT-4o (76.6%) and surpassing Claude 3 Opus
  • First open-source model to score above 10% on SWEBench (specifically surpassing this threshold, though exact paper score is implied by 'surpasses a score of 10%')
Breakthrough Assessment
9/10
First open-source code model to credibly claim parity with GPT-4 Turbo across coding and math benchmarks, utilizing an efficient MoE architecture and massive data scale.
×