← Back to Paper List

DeepSeek-Coder: When the LLM meets programming -- The rise of code intelligence

Deepseek
DeepSeek-AI, Peking University
arXiv, 1/2024 (2024)
Pretraining Benchmark Reasoning

📝 Paper Summary

Code Large Language Models Repository-level Code Understanding
DeepSeek-Coder is a series of open-source code models trained from scratch on 2 trillion tokens using repository-level data construction and a large 16K context window to master cross-file dependencies.
Core Problem
Existing open-source code models often lag behind closed-source counterparts and struggle with project-level contexts because they are typically trained on individual files, ignoring cross-file dependencies.
Why it matters:
  • Real-world software development requires understanding dependencies across multiple files, not just isolated snippets
  • Closed-source models (like GPT-4) restrict research access and commercial application due to proprietary nature
  • Standard training objectives (next-token prediction) on single files fail to capture the structural relationships inherent in complex software repositories
Concrete Example: In a project where `file A` defines a utility function used in `file B`, a model trained only on isolated files might hallucinate the function's signature when generating code for `file B`. DeepSeek-Coder parses the repository dependency graph to place `file A`'s content before `file B` in the context window, ensuring accurate invocation.
Key Novelty
Repository-Level Pre-training with Fill-In-Middle (FIM)
  • Constructs training data by topologically sorting files based on dependency graphs (e.g., imports/includes) so the model sees definitions before usages within the same context window
  • Combines Next-Token Prediction with a Fill-In-Middle (FIM) objective at the document level to enhance code infilling capabilities efficiently
Evaluation Highlights
  • DeepSeek-Coder-Base 33B achieves 56.1% Pass@1 on HumanEval, outperforming CodeLlama-34B (48.2%) and StarCoder-16B (31.7%)
  • DeepSeek-Coder-Instruct 33B reaches 79.3% on HumanEval, surpassing GPT-3.5-Turbo (76.2%) and narrowing the gap with GPT-4
  • On the LeetCode Contest benchmark (hard, unseen problems), the Instruct 33B model achieves 27.8% Pass@1, beating CodeLlama-34B-Instruct (9.4%) significantly
Breakthrough Assessment
9/10
Sets a new state-of-the-art for open-source code models, outperforming major competitors like CodeLlama and StarCoder. The repository-level data construction is a significant methodological improvement for practical coding tasks.
×