YuLan-Mini: An Open Data-efficient Language Model

📝 Paper Summary

LLM Pre-training Small Language Models (SLMs) Training Stability

YuLan-Mini is a 2.42B parameter model that achieves top-tier performance by combining a rigorous data pipeline, stability-focused optimization techniques like hidden state monitoring, and a targeted annealing phase.

Core Problem

Developing competitive Large Language Models (LLMs) with limited compute resources (e.g., in university labs) is difficult due to training instability and the need for massive, high-quality data.

Why it matters:

Open-source models often underperform industry counterparts due to data and compute constraints
Training instability (loss spikes, gradient explosions) frequently causes failures in large-scale runs, wasting expensive resources
University labs lack the massive infrastructure of industry giants, necessitating higher data efficiency and stability

Concrete Example: In standard pre-training, hidden state variances can grow exponentially across layers. Without intervention, this leads to gradient explosion and loss spikes, causing the training run to diverge or fail, as observed in the paper's proxy model experiments.

Key Novelty

Stability-First Pre-training for Resource-Constrained Environments

Combines data cleaning with a curriculum schedule that adjusts data mix based on model performance and perplexity during training
Mitigates training instability by monitoring hidden state variance and applying specific initialization and re-parameterization techniques
Utilizes a two-phase context extension strategy (annealing) to efficiently transition from 4K to 28K context length using high-quality synthetic and instruction data

Architecture

Charts analyzing training stability indicators (Hidden State Variance, Gradient Norm) across layers and steps

Evaluation Highlights

YuLan-Mini (2.42B parameters) achieves 64.00 on HumanEval (zero-shot), outperforming Llama-3-8B-Instruct (62.2)
On MATH-500 (four-shot), the model scores 37.80, surpassing Qwen2.5-3B (36.0)
Achieves 49.10 on MMLU (five-shot), competitive with larger models like Gemma-2-2B (52.4) and Qwen2.5-3B (51.8)

Breakthrough Assessment

7/10

While not a new architecture, it provides a highly valuable recipe for training high-performance small models with limited resources, achieving results competitive with industry models significantly larger or trained on more data.

⚙️ Technical Details

Problem Definition

Setting: Pre-training a decoder-only transformer language model on a large corpus of text code and math data

Inputs: Tokenized sequences from a 1.08T token dataset (English, Chinese, Code, Math)

Outputs: Next-token probabilities

Pipeline Flow

Data Preparation: Cleaning and Tokenization
Curriculum Scheduling (WSD)
Model Optimization (Stable Training + Annealing)

System Modules

Tokenizer

Converts text to tokens using BPE with a vocabulary of ~99k

Model or implementation: BPE (MiniCPM tokenizer base)

Transformer Block

Core processing unit containing Attention and FFN layers

Model or implementation: Decoder-only Transformer (2.42B params)

Novel Architectural Elements

Re-parameterization of weight matrices with an extra learnable parameter (WeSaR) to enhance training stability

Modeling

Base Model: YuLan-Mini (2.42B parameters)

Training Method: Pre-training followed by Annealing (SFT/DPO not detailed in this report)

Objective Functions:

Purpose: Predict the next token in the sequence.

Formally: Standard Cross-Entropy Loss

Adaptation: Annealing phase adjusts data mix and extends context length

Trainable Parameters: 2.42B total, 2.23B non-embedding

Training Data:

1.08T tokens total
481B English web, 138B English knowledge, 227B code, 93.8B math, 108B Chinese
Synthetic data used heavily (e.g., o1-like long-thought data)

Key Hyperparameters:

learning_rate: 0.01 (maximum global LR)
batch_size: 4.12M tokens
sequence_length: 4,096 (extended to 28K in annealing)
+ 6 more
weight_decay: 0.1
beta1: 0.9
beta2: 0.95
epsilon: 1e-15
z_loss_coefficient: 1e-4
initialization_variance: 5e-5

Compute: Train on 48-56 NVIDIA A800 GPUs. MFU estimated at 51.57%.

Comparison to Prior Work

vs. Qwen2.5-3B: YuLan-Mini achieves comparable or better math/coding performance with fewer parameters (2.4B vs 3B) and potentially less data (1T vs undisclosed trillion scale) [not cited in paper as direct architecture comparison, but used as baseline]
vs. MiniCPM: Adopts similar WSD scheduler but introduces specific stability measures (WeSaR re-parameterization) and targeted long-context annealing
vs. Llama-3-8B-Instruct: Outperforms on coding (HumanEval) despite being ~3x smaller

Limitations

Context length limited to 28K due to resource constraints, lower than some contemporaries (128K+)
Model size (2.4B) limits capacity for extremely nuanced general knowledge compared to 7B+ models
Trained on only 1.08T tokens, which is less than industry standards (often >3T)
Annealing stage requires careful data selection, which can be sensitive to heuristics

Reproducibility

Code: https://github.com/RUC-GSAI/YuLan-Mini

Data composition, tokenizer, and model checkpoints are released. Code is available at https://github.com/RUC-GSAI/YuLan-Mini. Training logs and exact data subsets (due to copyright) might be partial, but open dataset sources (FineWeb-Edu, etc.) are listed.

📊 Experiments & Results

Evaluation Setup

Comprehensive benchmarking across math, coding, and general knowledge tasks using LLMBox and vLLM

Benchmarks:

MMLU (General Knowledge (Multiple Choice))
HumanEval (Python Coding Generation)
MATH-500 (Mathematical Reasoning)
GSM8K (Grade School Math)
MBPP (Python Coding)

Metrics:

Accuracy (pass@1 for code)
Exact Match
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
YuLan-Mini demonstrates strong performance against larger or similar-sized baselines, particularly in math and coding tasks.
MMLU (5-shot)	Accuracy	46.2	49.10	+2.90
HumanEval (0-shot)	pass@1	62.2	64.00	+1.80
MATH-500 (4-shot)	Accuracy	36.0	37.80	+1.80
GSM8K (4-shot)	Accuracy	53.8	73.30	+19.50
MBPP (3-shot)	pass@1	40.8	52.80	+12.00

Experiment Figures

Radar chart comparing YuLan-Mini against other models (Qwen2.5, Gemma-2, Llama-3) across 8 benchmarks

Main Takeaways

Data quality and scheduling (WSD) allow a 2.4B model trained on 1T tokens to rival models trained on significantly more data
Targeted annealing with instruction and synthetic data boosts performance significantly in the final training phase
Optimization for stability (controlling hidden state variance) is critical for successful training of this scale without loss spikes

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Attention, FFN, LayerNorm)
Pre-training pipelines (tokenization, data scheduling)
Optimization stability (vanishing/exploding gradients)
Scaling laws for LLMs

Key Terms

WSD: Warmup-Stable-Decay—a learning rate scheduler that maintains a constant high learning rate for a stable phase before a rapid decay (annealing)

Annealing: The final phase of training where the learning rate is decayed to zero, often accompanied by high-quality data to boost performance

SwiGLU: Swish Gated Linear Unit—an activation function that combines the Swish activation with a gating mechanism, known for better performance in LLMs

GQA: Grouped-Query Attention—an attention mechanism that groups query heads to share key-value heads, reducing memory usage and speeding up inference

RoPE: Rotary Positional Embedding—a method for encoding positional information by rotating the query and key vectors

RMSNorm: Root Mean Square Layer Normalization—a normalization technique that simplifies LayerNorm by removing the mean subtraction, improving efficiency

TEV: Token Embedding Variability—a metric measuring the variance of data entries within a vector to detect distribution shifts

MFU: Model FLOPs Utilization—the ratio of the achieved floating-point operations per second to the theoretical peak performance of the hardware