← Back to Paper List

LLM360 K2: Scaling Up 360-Open-Source Large Language Models

Z Liu, B Tan, H Wang, W Neiswanger, T Tao, H Li…
Mohamed bin Zayed University of Artificial Intelligence, Petuum, Inc., Carnegie Mellon University, University of Southern California, University of Illinois Urbana-Champaign, University of California San Diego, Rutgers University
arXiv, 1/2025 (2025)
Pretraining Reasoning Benchmark

📝 Paper Summary

Open Source Large Language Models LLM Pretraining and Fine-tuning Data Curation
The K2 project releases a fully reproducible 65B parameter LLM with all artifacts—including intermediate checkpoints, training logs, and exact data sequences—to democratize access to large-scale AI development.
Core Problem
While many 'open' LLMs exist, the training details, exact data sequences, and intermediate states of the largest models (65B+) remain proprietary, preventing the community from studying training dynamics like loss spikes.
Why it matters:
  • Lack of transparency prevents researchers from learning how to mitigate training instabilities in large-scale models.
  • Without access to intermediate checkpoints and data, the community cannot study the longitudinal evolution of model capabilities.
  • High computational costs erect a barrier to entry, meaning only large tech companies currently hold the knowledge of how to train state-of-the-art scale models.
Concrete Example: When a large model encounters a 'loss spike' (divergence) during training, external researchers typically cannot see the logs or model state to analyze why it happened. K2 releases the exact checkpoints and logs surrounding two 'malignant' spikes it encountered, allowing the community to analyze these failures directly.
Key Novelty
360-degree Open Source Framework for 65B Scale
  • Releases not just the final weights, but 140 intermediate checkpoints, the exact data sequence used for each step, and full W&B training logs.
  • Provides a 'longitudinal capability study' showing how specific skills (math, coding) emerge and evolve throughout the training process.
  • Releases 'failed' artifacts (checkpoints from loss spikes) to foster research into training stability, rather than hiding these errors.
Evaluation Highlights
  • K2 Diamond outperforms LLaMA-65B and rivals Llama2-70B on GSM8K and HumanEval benchmarks despite using fewer tokens.
  • Achieves ~35% reduction in FLOPs compared to Llama2-70B while demonstrating superior mathematical reasoning and coding capabilities.
  • Surpasses Llama2-70B on medical domain benchmarks like MedQA and PubMedQA.
Breakthrough Assessment
9/10
While not SOTA in raw performance compared to closed models like GPT-4, the level of transparency (releasing 140 checkpoints, exact data order, and failure logs) for a 65B model is unprecedented and invaluable for the research community.
×