LLM360 K2: Scaling Up 360-Open-Source Large Language Models

📝 Paper Summary

Open Source Large Language Models LLM Pretraining and Fine-tuning Data Curation

The K2 project releases a fully reproducible 65B parameter LLM with all artifacts—including intermediate checkpoints, training logs, and exact data sequences—to democratize access to large-scale AI development.

Core Problem

While many 'open' LLMs exist, the training details, exact data sequences, and intermediate states of the largest models (65B+) remain proprietary, preventing the community from studying training dynamics like loss spikes.

Why it matters:

Lack of transparency prevents researchers from learning how to mitigate training instabilities in large-scale models.
Without access to intermediate checkpoints and data, the community cannot study the longitudinal evolution of model capabilities.
High computational costs erect a barrier to entry, meaning only large tech companies currently hold the knowledge of how to train state-of-the-art scale models.

Concrete Example: When a large model encounters a 'loss spike' (divergence) during training, external researchers typically cannot see the logs or model state to analyze why it happened. K2 releases the exact checkpoints and logs surrounding two 'malignant' spikes it encountered, allowing the community to analyze these failures directly.

Key Novelty

360-degree Open Source Framework for 65B Scale

Releases not just the final weights, but 140 intermediate checkpoints, the exact data sequence used for each step, and full W&B training logs.
Provides a 'longitudinal capability study' showing how specific skills (math, coding) emerge and evolve throughout the training process.
Releases 'failed' artifacts (checkpoints from loss spikes) to foster research into training stability, rather than hiding these errors.

Evaluation Highlights

K2 Diamond outperforms LLaMA-65B and rivals Llama2-70B on GSM8K and HumanEval benchmarks despite using fewer tokens.
Achieves ~35% reduction in FLOPs compared to Llama2-70B while demonstrating superior mathematical reasoning and coding capabilities.
Surpasses Llama2-70B on medical domain benchmarks like MedQA and PubMedQA.

Breakthrough Assessment

9/10

While not SOTA in raw performance compared to closed models like GPT-4, the level of transparency (releasing 140 checkpoints, exact data order, and failure logs) for a 65B model is unprecedented and invaluable for the research community.

⚙️ Technical Details

Problem Definition

Setting: Pretraining a Large Language Model (LLM) from scratch on a massive corpus, followed by instruction tuning.

Inputs: Tokenized text sequences from a diverse 1.4T token corpus (web, code, books, papers).

Outputs: Next-token probability distributions; ultimately generated text for downstream tasks.

Pipeline Flow

Data Curation (TxT360 pipeline: filtering, deduplication, PII removal)
Tokenizer Training (Custom LLaMA-based with code tokens)
Stage 1 Pretraining (Major Stage: 1.4T tokens, context len 2048)
Stage 2 Pretraining (Long-Context Stage: 69.3B tokens, context len 8192)
Post-Training (Instruction Tuning for Chat)

System Modules

Data Curation (TxT360)

Filter, deduplicate, and mix data from 14 sources (Web, Papers, Code, etc.)

Model or implementation: N/A (Data Pipeline)

K2 Diamond (Base Model)

Predict next token to learn language representations and world knowledge

Model or implementation: 65B parameters, 80 layers, 8192 hidden size, LLaMA-architecture

K2 Chat (Chat Model)

Align base model to follow instructions and engage in dialogue

Model or implementation: K2 Diamond + Fine-tuning

Novel Architectural Elements

Release of 'Malignant Spike' checkpoints: Deliberately saving and releasing model states where training diverged to enable research on instability.
Two-stage curriculum with distinct context lengths: Stage 1 (2048 tokens) for bulk knowledge, Stage 2 (8192 tokens) for reasoning/long-context adaptation.

Modeling

Base Model: K2 Diamond (65B parameters, LLaMA-like architecture, no GQA)

Training Method: Supervised Fine-Tuning (SFT) for Chat

Adaptation: Full fine-tuning

Training Data:

Pretraining: 1.4T tokens (Web, Books, Code, Papers, Math)
Fine-tuning: OpenHermes-2.5 (1M), FLAN (3M), MathInstruct (262k), Do-Not-Answer (Safety)

Key Hyperparameters:

layers: 80
hidden_size: 8192
attention_heads: 64
+ 6 more
vocab_size: 32018
learning_rate_stage_1: 1.5e-4 to 1.5e-5 (cosine)
learning_rate_stage_2: 1e-4 to 0 (linear)
global_batch_size: 2040
weight_decay: 0.1
optimizer: AdamW (beta1=0.9, beta2=0.95)

Compute: 480 NVIDIA A100 80GB GPUs. Trained on 1.4T tokens.

Comparison to Prior Work

vs. LLaMA-65B: K2 releases all intermediate checkpoints and data sequences, whereas LLaMA only releases final weights.
vs. Llama2-70B: K2 uses ~35% fewer FLOPs/tokens but matches performance on reasoning/coding; K2 lacks GQA.
vs. OLMo [not cited in paper]: Similar mission of transparency, but K2 specifically targets the 65B scale and emphasizes the release of 'loss spike' artifacts for stability research.

Limitations

The model does not use Group Query Attention (GQA), potentially making inference slower compared to Llama2-70B.
Trained on 1.4T tokens, which is less than Llama2 (2T) or Llama3 (15T), potentially limiting saturation of capabilities.
Hardware instability (A100 failures) caused interruptions, requiring manual intervention and rollbacks.
Only 120 of the 360 total Stage 1 checkpoints are currently uploaded due to storage constraints.

Reproducibility

Code: https://github.com/llm360

Extremely high reproducibility. Available: 140 intermediate checkpoints (huggingface.co/LLM360/K2), full code (github.com/llm360), W&B logs (wandb.ai/llm360/K2-Diamond), exact data sequence (huggingface.co/datasets/LLM360/K2Datasets), and prompt galleries.

📊 Experiments & Results

Evaluation Setup

Standard NLP benchmarks covering general language, reasoning, coding, and medical domains.

Benchmarks:

GSM8K (Mathematical Reasoning)
HumanEval (Code Generation)
MMLU (General Knowledge)
MedQA (Biomedical Question Answering)
PubMedQA (Biomedical Question Answering)

Metrics:

Accuracy
Pass@1
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
K2 Diamond demonstrates strong performance against LLaMA-65B and Llama2-70B, particularly in reasoning and domain-specific tasks, despite fewer training tokens.
GSM8K	Accuracy	Not reported in the paper	Not reported in the paper	Not reported in the paper
MedQA	Accuracy	Not reported in the paper	Not reported in the paper	Not reported in the paper
Compute Efficiency	FLOPs	100%	65%	-35%

Main Takeaways

Transparency is the primary contribution: The release of 140 checkpoints and failed training runs (loss spikes) offers unique resources for understanding LLM training dynamics.
Data quality compensates for quantity: K2 rivals larger-data models (Llama2) on reasoning/medical tasks despite using only 1.4T tokens, suggesting the TxT360 data pipeline is highly effective.
Hardware resilience is critical: The team faced frequent A100 failures (ECC errors, node dropouts), necessitating a robust fault-tolerant training infrastructure.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Transformer architecture (attention, feed-forward networks)
Familiarity with distributed training (data parallelism, tensor parallelism, pipeline parallelism)
Knowledge of LLM pretraining datasets (CommonCrawl, etc.) and tokenization

Key Terms

LLM360: An initiative to provide fully open-source LLMs with complete transparency, including training code, data, logs, and checkpoints.

Loss Spike: A sudden increase in the loss function during training, often indicating instability or divergence; K2 categorizes these as 'benign' (recoverable) or 'malignant' (requiring rollback).

TxT360: The fully open data curation pipeline and dataset developed by the authors, ensuring high-quality pretraining data.

RoPE: Rotary Positional Embeddings—a method for encoding position information in Transformers that generalizes well to longer sequences.

FlashAttention-2: An exact, memory-aware IO-aware attention algorithm that speeds up training and reduces memory usage.

AdamW: A stochastic optimization method that modifies the typical Adam implementation of weight decay to decouple it from the gradient update.

CoT: Chain-of-Thought—a prompting technique where the model is encouraged to generate intermediate reasoning steps before the final answer.

FLOPs: Floating Point Operations—a measure of compute cost; reducing FLOPs means the model is more efficient to train or run.

SFT: Supervised Fine-Tuning—the process of training a pre-trained base model on labeled instruction-response pairs.

GQA: Grouped Query Attention—an interpolation between multi-head and multi-query attention; notably NOT used in K2, which uses standard attention.