DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

📝 Paper Summary

Large Language Model Scaling Laws Open-Source Foundation Models

DeepSeek LLM revisits scaling laws to derive optimal hyperparameter and model/data allocation strategies, resulting in a 67B parameter model that outperforms LLaMA-2 70B.

Core Problem

Prior scaling law studies (e.g., Chinchilla, Kaplan et al.) offer conflicting conclusions on optimal model/data allocation and often lack precise guidance on hyperparameter scaling for large budgets.

Why it matters:

Inaccurate scaling laws lead to inefficient compute usage during the costly pre-training of large language models.
Existing open-source models often scale up without a rigorous theoretical basis for their specific data quality and compute budget.
The community lacks transparency on how data quality specifically alters the optimal ratio between model size and training tokens.

Concrete Example: Previous laws suggested specific token-per-parameter ratios (e.g., 20:1 for Chinchilla), but DeepSeek finds that with higher quality data, the optimal allocation shifts significantly towards larger models (higher alpha, lower beta), meaning Chinchilla-optimal allocations may under-train models on high-quality corpora.

Key Novelty

Scaling Laws with Non-Embedding FLOPs and Data Quality Awareness

Proposes using 'non-embedding FLOPs/token' instead of parameter count to represent model scale, correcting for attention overhead and vocabulary parameter discrepancies.
Demonstrates that optimal model/data allocation is not static; higher quality data dictates allocating more compute to model size rather than data quantity (larger model, fewer tokens).
Derives empirical formulae for optimal batch size and learning rate as power-law functions of the compute budget.

Evaluation Highlights

DeepSeek LLM 67B surpasses LLaMA-2 70B on reasoning benchmarks (e.g., +12.3 accuracy on GSM8K).
DeepSeek LLM 67B Chat outperforms GPT-3.5 on open-ended evaluations in both English and Chinese.
The 7B model achieves higher performance than LLaMA-2 7B across nearly all reported benchmarks (e.g., +8.8 points on MATH).

Breakthrough Assessment

8/10

Significant contribution to scaling law theory by incorporating data quality and non-embedding FLOPs. The resulting 67B model is a strong open-source contender, outperforming the LLaMA-2 70B baseline.

⚙️ Technical Details

Problem Definition

Setting: Self-supervised pre-training of decoder-only Transformers followed by SFT and DPO alignment

Inputs: Tokenized text sequences (2 trillion tokens for pre-training)

Outputs: Next-token probability distributions

Pipeline Flow

Data Processing (Deduplication -> Filtering -> Remixing)
Tokenizer (BBPE)
Pre-training (DeepSeek LLM Base)
Alignment (SFT -> DPO)

System Modules

Tokenizer

Converts text to tokens

Model or implementation: Byte-level BPE

DeepSeek LLM Base

Next-token prediction (Pre-training)

Model or implementation: Decoder-only Transformer (LlaMA-like architecture)

Alignment (SFT & DPO)

Align model with user intent

Model or implementation: Fine-tuned DeepSeek LLM

Novel Architectural Elements

Use of multi-step learning rate scheduler specifically to facilitate continual training and reuse of pre-training phases
67B model scales via depth (95 layers) rather than just width, unlike standard LLaMA scaling

Modeling

Base Model: DeepSeek LLM 7B and 67B

Training Method: Supervised Fine-Tuning (SFT) followed by Direct Preference Optimization (DPO)

Objective Functions:

Purpose: Minimize prediction error on next token.

Formally: Standard Cross-Entropy Loss
Purpose: Optimize policy to prefer chosen responses over rejected ones.

Formally: DPO Loss (implicit reward maximization)

Adaptation: Full fine-tuning

Trainable Parameters: Full model (7B or 67B parameters)

Training Data:

Pre-training: 2 trillion tokens (Chinese and English)
SFT: 1.5 million instances (helpful and harmless data)
DPO: Preference pairs

Key Hyperparameters:

learning_rate: Batch/LR vary by scale (see Table 2 in paper). 67B Max LR: 3.2e-4
batch_size: Gradually increases. 67B Max Batch Size: 9M tokens (4608 samples)
optimizer: AdamW (beta1=0.9, beta2=0.95, weight_decay=0.1)
+ 3 more
scheduler: Multi-step (100% -> 31.6% at 80% tokens -> 10% at 90% tokens)
gradient_clipping: 1.0
warmup_steps: 2000

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. LLaMA-2: DeepSeek uses multi-step LR schedule, supports Chinese natively, and 67B uses GQA while LLaMA-2 70B uses GQA (but DeepSeek 67B is deeper/thinner).
vs. Chinchilla Scaling: DeepSeek identifies that higher quality data requires larger models (higher alpha) than Chinchilla suggests.

Limitations

The 67B model requires significant compute resources, making it less accessible for researchers with limited hardware.
The pre-training dataset is not publicly released, limiting full reproducibility of the pre-training phase.
Scaling laws for hyperparameters are derived from smaller scale experiments (up to 1e19 FLOPs) and extrapolated.
Safety evaluations are primarily conducted on specific benchmarks, which may not cover all real-world safety risks.

Reproducibility

Code: https://github.com/deepseek-ai/DeepSeek-LLM

Available: 7B/67B Base and Chat models released on HuggingFace. Code for pre-training/fine-tuning/eval released on GitHub. Missing: The exact 2T token pre-training dataset is not released (only composition described).

📊 Experiments & Results

Evaluation Setup

Comprehensive evaluation on standard NLP benchmarks (Base models) and open-ended generation (Chat models).

Benchmarks:

MMLU (Multi-task Language Understanding)
GSM8K (Grade School Math)
MATH (Mathematics)
HumanEval (Python Coding)
MBPP (Python Coding)
HellaSwag (Commonsense Reasoning)

Metrics:

Accuracy (pass@1)
Few-shot accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
DeepSeek LLM 67B (Base) demonstrates superior performance compared to LLaMA-2 70B across math, code, and general reasoning benchmarks.
MMLU	Accuracy (5-shot)	68.9	71.3	+2.4
GSM8K	Accuracy (8-shot)	56.8	69.1	+12.3
MATH	Accuracy (4-shot)	10.6	18.7	+8.1
HumanEval	pass@1	29.9	43.3	+13.4
DeepSeek LLM 7B (Base) significantly outperforms LLaMA-2 7B, especially in code and math.
GSM8K	Accuracy (8-shot)	14.6	39.4	+24.8
HumanEval	pass@1	12.8	26.8	+14.0

Experiment Figures

Optimal Batch Size and Learning Rate scaling with Compute Budget.

IsoFLOP curves and optimal model/data scaling fit.

Main Takeaways

Scaling laws are sensitive to data quality: High-quality data favors scaling model size over data size for a fixed compute budget.
Multi-step learning rate schedulers can achieve comparable performance to cosine schedulers while enabling easier continual training.
DeepSeek 67B achieves SOTA performance among open-source models (at time of release), particularly excelling in mathematical reasoning and coding tasks compared to LLaMA-2.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Decoder-only)
Scaling Laws (Chinchilla/Kaplan)
Reinforcement Learning from Human Feedback (RLHF) concepts

Key Terms

DPO: Direct Preference Optimization—an alignment method that optimizes a policy to satisfy human preferences without an explicit reward model loop

Non-embedding FLOPs/token: A metric for model scale that counts floating point operations per token excluding the embedding layer, offering a more accurate proxy for compute than parameter count

IsoFLOP profile: A method to find optimal model/data allocations by fixing total compute (FLOPs) and varying model size and training data size to minimize loss

SwiGLU: A gated activation function (Swish-Gated Linear Unit) used in the feed-forward networks of modern LLMs

GQA: Grouped-Query Attention—an attention mechanism that shares key/value heads across multiple query heads to reduce memory bandwidth usage during inference

Rotary Embedding: A positional encoding method that rotates token embeddings in a high-dimensional space to encode relative positions

Multi-step learning rate scheduler: A schedule where the learning rate drops by a fixed factor at specific milestones (steps), rather than decaying continuously like a cosine schedule

BPE: Byte-Pair Encoding—a tokenization algorithm that iteratively merges frequent pairs of bytes/characters