← Back to Paper List

FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models

H Kang, Z Yu, C Xiong
Carnegie Mellon University
arXiv, 5/2025 (2025)
Pretraining Benchmark

📝 Paper Summary

Mixture-of-Experts (MoE) Scaling Laws Efficient Pretraining
FLAME-MoE provides a fully open suite of seven compute-optimal Mixture-of-Experts models alongside a derived scaling law to facilitate reproducible research on sparse model training dynamics.
Core Problem
While MoE architectures are widely used in production, the research community lacks a fully open, end-to-end platform for investigating their scaling, routing behaviors, and training dynamics.
Why it matters:
  • Existing open-source MoE efforts focus on architectural design or downstream performance but offer limited support for studying training dynamics and routing evolution.
  • Without transparent logs and checkpoints, researchers cannot analyze internal behaviors like expert specialization or load balancing, impeding systematic improvements.
  • The lack of a shared experimental platform prevents rigorous cross-scale comparisons and reproducible science in the sparse model domain.
Concrete Example: Unlike dense models which benefit from the Pythia suite's transparency, MoE researchers struggle to answer questions about when expert specialization emerges or how routing stabilizes because commercial models (e.g., Gemini-1.5) do not release training traces or intermediate checkpoints.
Key Novelty
FLAME-MoE: Open MoE Research Suite & Scaling Law
  • Releases a family of 7 decoder-only MoE models (38M–1.7B active parameters) trained on compute-optimal budgets derived from a dedicated scaling law study.
  • Provides full transparency by releasing not just weights, but also training data pipelines, code, logs, and intermediate checkpoints to enable deep analysis of internal dynamics.
Evaluation Highlights
  • FLAME-MoE outperforms dense baselines trained with identical FLOPs by up to 3.4 percentage points in average accuracy across 6 downstream tasks.
  • Scaling law analysis achieves a strong Spearman correlation of 0.89 between predicted validation loss and actual downstream performance on HellaSwag.
  • Matches or outperforms dense models trained with 2x the compute budget (e.g., FLAME-MoE-400M matches Dense-400M-2x), demonstrating superior training efficiency.
Breakthrough Assessment
8/10
Significant contribution to open science by providing the 'Pythia of MoEs'. While the architecture isn't radically new, the rigorous scaling law derivation and full artifact release fill a critical infrastructure gap.
×