← Back to Paper List

(FLAN) Mixture-of-Experts Meets Instruction tuning: A Winning Combination for LLMs

Sheng Shen, Le Hou, Yan-Quan Zhou, Nan Du, S. Longpre, Jason Wei, Hyung Won Chung, Barret Zoph, W. Fedus, Xinyun Chen, Tu Vu, Yuexin Wu, Wuyang Chen, Albert Webson, Yunxuan Li, Vincent Y. Zhao, Hongkun Yu, K. Keutzer, T. Darrell, Denny Zhou
Google, University of California, Berkeley, Massachusetts Institute of Technology, University of Massachusetts Amherst, The University of Texas at Austin
arXiv, 5/2023 (2023)
Pretraining Reasoning QA Benchmark

📝 Paper Summary

Instruction Tuning Sparse Mixture-of-Experts (MoE)
Combining instruction tuning with sparse Mixture-of-Experts (MoE) models allows for massive parameter scaling without increasing inference costs, enabling smaller MoE models to outperform much larger dense models.
Core Problem
Sparse MoE models often underperform dense models of equivalent computational cost when fine-tuned directly on downstream tasks, suffering from a discrepancy between general pretraining and task-specific finetuning.
Why it matters:
  • Growing computational costs of dense LLMs limit their scalability and deployment
  • Previous attempts to use MoEs for task-specific finetuning yielded suboptimal results, often worse than dense baselines
  • Bridging the gap between pretraining and downstream performance is crucial for utilizing efficient sparse architectures
Concrete Example: When directly fine-tuned on a downstream task without instruction tuning, an MoE model might achieve lower accuracy than a dense T5 model using the same FLOPs. However, with instruction tuning, the MoE significantly outperforms the dense equivalent.
Key Novelty
Instruction-Tuned Sparse Mixture-of-Experts (Flan-MoE)
  • Combines the parameter-efficiency of sparse MoE architectures (like Switch Transformer) with the generalization capabilities of instruction tuning (like Flan)
  • Demonstrates that instruction tuning is the 'missing link' that unlocks the potential of MoE models, allowing them to surpass dense models in zero-shot and few-shot settings
Evaluation Highlights
  • Flan-ST32B surpasses Flan-PaLM62B on four benchmark tasks while using only ~30% of the FLOPs per token
  • Instruction tuning boosts MoE performance on MMLU by up to 45.2% (for ST32B), compared to only 6.6% for dense Flan-PaLM62B
  • On zero-shot and few-shot MMLU-Direct, Flan-MoE provides absolute performance improvements of 7.1% on average over dense baselines at the same compute cost
Breakthrough Assessment
8/10
Provides compelling evidence that instruction tuning fixes the fragility of MoE fine-tuning, establishing a new Pareto frontier for efficient large-scale language modeling.
×