← Back to Paper List

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

X Bi, D Chen, G Chen, S Chen, D Dai, C Deng, H Ding…
DeepSeek-AI
arXiv, 1/2024 (2024)
Pretraining RL Reasoning Benchmark

📝 Paper Summary

Large Language Model Scaling Laws Open-Source Foundation Models
DeepSeek LLM revisits scaling laws to derive optimal hyperparameter and model/data allocation strategies, resulting in a 67B parameter model that outperforms LLaMA-2 70B.
Core Problem
Prior scaling law studies (e.g., Chinchilla, Kaplan et al.) offer conflicting conclusions on optimal model/data allocation and often lack precise guidance on hyperparameter scaling for large budgets.
Why it matters:
  • Inaccurate scaling laws lead to inefficient compute usage during the costly pre-training of large language models.
  • Existing open-source models often scale up without a rigorous theoretical basis for their specific data quality and compute budget.
  • The community lacks transparency on how data quality specifically alters the optimal ratio between model size and training tokens.
Concrete Example: Previous laws suggested specific token-per-parameter ratios (e.g., 20:1 for Chinchilla), but DeepSeek finds that with higher quality data, the optimal allocation shifts significantly towards larger models (higher alpha, lower beta), meaning Chinchilla-optimal allocations may under-train models on high-quality corpora.
Key Novelty
Scaling Laws with Non-Embedding FLOPs and Data Quality Awareness
  • Proposes using 'non-embedding FLOPs/token' instead of parameter count to represent model scale, correcting for attention overhead and vocabulary parameter discrepancies.
  • Demonstrates that optimal model/data allocation is not static; higher quality data dictates allocating more compute to model size rather than data quantity (larger model, fewer tokens).
  • Derives empirical formulae for optimal batch size and learning rate as power-law functions of the compute budget.
Evaluation Highlights
  • DeepSeek LLM 67B surpasses LLaMA-2 70B on reasoning benchmarks (e.g., +12.3 accuracy on GSM8K).
  • DeepSeek LLM 67B Chat outperforms GPT-3.5 on open-ended evaluations in both English and Chinese.
  • The 7B model achieves higher performance than LLaMA-2 7B across nearly all reported benchmarks (e.g., +8.8 points on MATH).
Breakthrough Assessment
8/10
Significant contribution to scaling law theory by incorporating data quality and non-embedding FLOPs. The resulting 67B model is a strong open-source contender, outperforming the LLaMA-2 70B baseline.
×