Mixture-of-Experts Process Groups Initialization

In recent years, the evolution of AI has been marked by a immense shift towards the scaling of language models/LLMs, characterized by the exponential growth in data, parameters, and model sizes. This transformation has yielded remarkable advancements in the field of NLP, empowering the accomplishment of more complex tasks, enhanced reasoning abilities, and the utilization of fewer labeled data. Notable milestones are include the development of renowned models like GPT-2, BERT, T5, GPT-3, FLAN, Gopher, Chinchilla, and PaLM.

Brief Introduction to MoE

While the current shape of language models continues to demonstrate effectiveness and high evaluation scores, the rising costs and energy consumption demands associated with further scaling are becoming overwhelmingly burdensome (Carbon Emissions and Large Neural Network Training, Patterson et al.,2021).

In response to this consumption growth, considering alternative architectures, such notable thirty-year-old but emerging sparsely expert models become an active area of research and experimentation in the context of large-scale deep learning. Various studies that are focusing on sparse expert models, demonstrated their capacity to yield novel performance improvements through the exploitation of neural network sparsity. This is achieved in conjunction with memory-efficient training and inference techniques, as an alternative to the conventional densely connected models. Ultimately, these approaches have shown their promise in reducing computational and energy requirements.

These sparse expert models include architectures such as Mixture-of-Experts, Switch Transformers, Routing Networks, Hash layers, and BASE layers. The common thread among these models is the shared idea that each individual instance is influenced by a specific subset of the parameters.

Many modern sparse expert models drew inspiration from the introduction of a new type of general-purpose neural network component: a Sparsely-Gated Mixture-of-Experts Layer (MoE) which comprises several experts, each functioning as a simple feed-forward neural network. This concept originated from the research titled ‘Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer’ authored by Shazeer et al. in 2017 as part of Hinton’s Group at Google. Before all, for twenty years of MoEs, it may be useful to look at this comprehensive survey: Seniha Esen Yuksel, Joseph N Wilson, and Paul D Gader. Twenty years of mixture of experts. IEEE transactions on neural networks and learning systems, 23(8):1177–1193, 2012.

The idea of a mixture-of-experts was already established three decades earlier, building upon the work of Jacobs et al. (1991), Jordan and Jacobs (1994). In the early concepts, the experts constituted entire neural networks, and the MoE resembled ensemble methods more closely. However, it wasn’t until ready the work of Shazeer et al. (2017) that the first large-scale success with this approach was achieved.

A promising alternative that enables the scaling of models in size without incurring their full computational cost is the use of sparse mixtures of experts.

This concept of sparse experts and its variants, with the most common variant being sparse expert models, harnesses the advantages of neural network sparsity. It allows networks to allocate different subsets of model weights to their inputs, resulting in significantly larger models with smaller computational footprints.

This approach has gained popularity within the NLP domain and has proven to be particularly well-suited for handling larger models. By leveraging the abundance of billions and trillions of tokens, especially in the context of tasks like next-word prediction, masked language modeling, and even vision-related tasks.

Moreover, the popularity of sparse expert models saw a significant boost when integrated with the current nearly de facto standard, the transformer architecture. Within Transformer models, MoE layers are frequently employed to select the FFN layers, which appear in each Transformer block following the multi-headed attention mechanism. Despite earlier success, GShard effected the research of MoE + Transformer models. Latest systems gone further with improved training and deployment of MoE models.

Below, comparison of dense - sparse model with transformer architecture can be seen: