LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training

📝 Paper Summary

Mixture-of-Experts (MoE) Model Sparsification Post-Training / Instruction Tuning

LLaMA-MoE v2 converts instructed dense LLMs into sparse MoE models by partitioning both Attention and MLP neurons, using a two-stage post-training strategy to recover performance without expensive pre-training.

Core Problem

Converting dense models to MoE typically requires resource-intensive continual pre-training and often neglects sparsity in the attention module.

Why it matters:

Standard dense models activate all parameters, limiting scaling efficiency compared to sparse models
Existing 'sparse upcycling' methods often duplicate parameters (increasing size) and require massive compute to retrain
Ignoring attention sparsity misses optimization opportunities, especially given the heterogeneity of attention head patterns

Concrete Example: Previous methods like Sparse Upcycling copy MLP layers to create experts, inflating the model size and necessitating heavy pre-training. LLaMA-MoE v2 instead partitions the existing neurons of an LLaMA-3-8B-Instruct model and recovers capabilities using only lightweight instruction tuning.

Key Novelty

Post-Training Oriented MoE Construction (Attention & MLP)

Constructs 'Attention MoE' by grouping attention heads into experts (respecting Grouped Query Attention constraints) and 'MLP MoE' by partitioning neurons based on importance.
Introduces a 'Residual MLP MoE' variant where common knowledge is extracted into a shared expert while other neurons form routed experts.
Employs a two-stage post-training pipeline (General -> Math/Code) to recover the performance of the sparsified instructed model.

Architecture

The overall framework for constructing LLaMA-MoE v2. It illustrates the conversion of Dense Attention and MLP blocks into their MoE counterparts and the subsequent two-stage post-training pipeline.

Breakthrough Assessment

7/10

Proposes a novel, cheaper pathway to MoE models (sparsifying instructed models + post-training) and addresses Attention sparsity, which is often overlooked. Impact depends on the (missing) quantitative results.

⚙️ Technical Details

Problem Definition

Setting: Sparsifying a pre-trained dense Transformer model M_dense into a Mixture-of-Experts model M_MoE

Inputs: Token sequence X

Outputs: Next token probabilities

Pipeline Flow

Input Token -> Attention MoE Layer (Router -> Selected Head Groups) -> MLP MoE Layer (Router -> Selected Neuron Groups) -> Output

System Modules

Attention MoE

Compute self-attention using a subset of attention heads selected by a router

Model or implementation: Sparsified LLaMA-3 Attention

MLP MoE

Process hidden states using a subset of MLP neurons selected by a router

Model or implementation: Sparsified LLaMA-3 MLP (Standard or Residual)

Router

Assign input tokens to the top-K experts

Model or implementation: Linear projection

Novel Architectural Elements

Simultaneous Attention MoE and MLP MoE in the same block
Neuron partitioning strategy based on gradient-based importance scores (|f * grad_f|)
Router initialization using clustered hidden features from the dense teacher model

Modeling

Base Model: LLaMA-3-8B

Training Method: Instruction Tuning (Two-stage)

Objective Functions:

Purpose: Standard language modeling.

Formally: Cross-entropy loss L_LM
Purpose: Ensure experts are utilized evenly to prevent routing collapse.

Formally: Load balancing loss L_LB (sum of fraction of tokens routed * routing probability)

Training Data:

Stage 1: General ability data (conversation, email writing)
Stage 2: Math and coding data + replay of general data to prevent forgetting

Key Hyperparameters:

load_balancing_alpha: 0.01

Comparison to Prior Work

vs. Sparse Upcycling: Partitions existing neurons instead of duplicating them; avoids massive parameter increase.
vs. DeepSeek-MoE: Constructs the shared/routed architecture via partitioning a dense instructed model rather than pre-training from scratch.
vs. Standard MoE: Sparsifies Attention modules in addition to MLPs.

Limitations

Significant performance drop observed immediately after sparsification (before post-training).
Requires carefully designed post-training data mixture (Two-stage) to recover capabilities.
Quantitative performance recovery metrics not available in the provided text excerpt.

Reproducibility

Code: https://github.com/OpenSparseLLMs/LLaMA-MoE-v2

Code and models available at https://github.com/OpenSparseLLMs/LLaMA-MoE-v2. The paper describes the specific partitioning logic (GQA constraints, importance scoring) needed to replicate the architecture construction.

📊 Experiments & Results

Evaluation Setup

Evaluation of instructed MoE models on diverse downstream tasks after sparsification and post-training.

Benchmarks:

General Conversation Tasks (Instruction Following)
Math Benchmarks (Mathematical Reasoning)
Code Benchmarks (Code Generation)

Metrics:

Not explicitly reported in the provided text
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Constructing MoE models from instructed dense LLMs is viable but causes initial performance degradation due to parameter sparsity.
A two-stage post-training strategy (General -> Specialized+Replay) is effective for recovering model capabilities without continual pre-training.
Attention modules can be effectively sparsified by grouping heads into experts, provided GQA constraints are respected.
Partitioning neurons based on importance allows for the creation of efficient MLP experts (both standard and residual).

📚 Prerequisite Knowledge

Prerequisites

Transformer Architecture (Attention, MLP)
Mixture-of-Experts (MoE)
Grouped Query Attention (GQA)
Instruction Tuning

Key Terms

MoE: Mixture-of-Experts—a model architecture where only a subset of network components (experts) are activated for each input

GQA: Grouped Query Attention—an attention mechanism where multiple query heads share a single key/value head to save memory

Sparsity: The property of activating only a fraction of model parameters during inference

Residual MoE: An MoE variant containing a 'shared expert' that is always activated to capture common knowledge, alongside routed experts

Instruction Tuning: Fine-tuning a pre-trained model on datasets of instructions and responses to improve its ability to follow user commands

Load Balancing Loss: An auxiliary loss function used during training to ensure that the router network distributes tokens evenly among experts