rw-book-cover

Metadata

Highlights

  • Mixture of Experts is a technique in AI where a set of specialized models (experts) are collectively orchestrated by a gating mechanism to handle different parts of the input space, optimizing for performance and efficiency. It leverages the fact that an ensemble of weaker language models specializing in specific tasks can produce more accurate results, similar to traditional ML ensemble methods. However, it introduces a new concept of dynamic routing of the input in the process of generation. In this blog post, I will explain how OpenAI leveraged it to effectively combine eight different models under what is called GPT-4, and how Mixtral’s architecture for this method was even more efficient. (View Highlight)
  • Here’s a surprising revelation: to build an LLM application, you will, of course, need an LLM. However, when you break down the functionalities of your LLM app, you’ll find that, like many other applications, different components are designed to serve very distinct purposes. Some components may be tasked with retrieving relevant data from a database, others might be engineered to generate a “chat” experience, and some could be responsible for formatting or summarization. Similar to traditional machine learning, where combining different models in ensemble techniques like boosting and bagging, Mixture of Experts in LLMs leverages a set of different transformer models, that are trained differently and leverages ML to also weight them differently to generate a complex inference pipeline. (View Highlight)
  • In the context of LLMs, the concept of ‘expertise’ takes a unique form. Each model, or ‘expert,’ naturally develops a proficiency in different topics as it undergoes the training process. In this setup, the role of a ‘coordinator,’ which in a human context might be a person overseeing a team, is played by a Gating Network. This network has the crucial task of directing inputs to the appropriate models based on the topic at hand. Over time, the Gating Network improves its understanding of each model’s strengths and fine-tunes its routing decisions accordingly. (View Highlight)
  • It’s important to clarify, however, that despite the use of the term ‘expert,’ these LLM models don’t possess expertise in the way we typically think of human specialists in fields of science or arts. Their ‘expertise’ resides in a complex, high-dimensional embedding space. The alignment of this expertise with our conventional, human-centric understanding of subjects can vary. The notion of categorizing these models into different domains of expertise is more of a conceptual tool to help us understand and navigate their diverse capabilities within the AI framework. (View Highlight)
  • In traditional models, all tasks are processed by a single, densely packed neural network, akin to a generalist handling every problem. However, for complex problems, it becomes hard to find a generalist model capable of handling everything which is why Mixture of Experts LLM is so valuable. (View Highlight)
  • On June 20th, George Hotz, the founder of self-driving startup Comma.ai, revealed that GPT-4 is not a single massive model, but rather a combination of 8 smaller models, each consisting of 220 billion parameters. This leak was later confirmed by Soumith Chintala, co-founder of PyTorch at Meta. (View Highlight)
  • For context, GPT-3.5 has around 175B parameters. However, just like we will cover in Mixtral, the calculation of the total number of parameters when using MoE is not so direct since only FFN (feed-forward network) layers are replicated between each expert, while the other layers can be shared by all. This may significantly decrease the total number of parameters of GPT-4. Regardless the total number should be somewhere between 1.2-1.7 Trillion parameters. (View Highlight)
  • Recent reports of the degradation of GPT-4 quality of answers and increased laziness may be directly connected to the fact that it is a MoE. Since OpenAI has been so focused on getting the inference costs down, while also decreasing the price-per-token for the user, they may be using fewer experts or smaller experts to build GPT4. (View Highlight)
  • Since each expert needs to be loaded into VRAM, thus occupying GPU even if only some layers are being used at each step, the hardware requirements are immense. That is why a small reduction in the expert’s size or number can have a big impact on costs, although performance may be affected as well. (View Highlight)
  • The reigning theory is that it is this cost reduction combined with more aggressive RLHF (Reinforcement Learning with Human-Feedback) that is causing the degradation of user experience and quality of answers. This focus on RLHF is mainly to make GPT4 more robust and useful for the company’s products but it gets less interesting for the everyday ChatGPT user. (View Highlight)
  • That is the problem with the lack of transparency on their side, we do not know what we are getting! We can only get some insights through some leaks that have happened and might happen. (View Highlight)
  • Mixtral is outperforming many large models while being efficient in inference. It employs a routing layer that decides which expert or combination of experts to use for each task, optimizing resource usage. It only has 46.7B parameters, but only uses about 12.9B per token. (View Highlight)
  • Despite its impressive capabilities, Mixtral faces challenges like any other MoE model, particularly in training and data management. (View Highlight)
  • Mixtral is a sparse mixture-of-experts (SMoE) network. At its core, it’s a decoder-only model, a design choice that differentiates it from models that include both encoder and decoder. (View Highlight)
  • The magic of Mixtral lies in how it handles its feedforward block. Here’s where the ‘experts’ come into play. Mixtral doesn’t rely on a single set of parameters; instead, it picks from eight distinct groups of parameters. This selection is dynamic and context-dependent. (View Highlight)
  • Token Routing: For every token in the input, a router network chooses two groups of experts. This dual selection allows for a nuanced and context-rich processing of information. (View Highlight)
  • Additive Output Combination: The outputs from these chosen experts are then combined additively, ensuring a rich blend of specialized knowledge. (View Highlight)
  • One might assume that having multiple experts would exponentially increase the parameter count. However, Mixtral balances this with efficiency: • Total Parameters: Mixtral boasts a total of 46.7 billion parameters. But, the efficient use of these parameters is what sets it apart. • Parameters per Token: It uses only about 12.9 billion parameters per token. This ingenious approach means that Mixtral operates with the speed and cost of a 12.9 billion parameter model, despite its larger size. (View Highlight)
  • Mixtral is not just better in raw output quality but also in inference speed, which is about six times faster. In the paper that the authors present Mixtral they also provide some comparrison with other models as seen below. (View Highlight)
  • By the way, if you have seen records of Mixtral_34Bx2_MoE_60B and other variants like the Mixtral_11Bx2_MoE_19B getting amazing results, just remember that despite having Mixtral in name, they are not Mistral-based. Rather they are Yi-based, so it’s specialized only for English and Chinese language output. (View Highlight)
    • Inference Speed: Despite their size, they offer faster inference, using only a fraction of their parameters at any given time. (View Highlight)
    • Lower Costs: When compared to a Dense model with the same total number of parameters MoE models are much cheaper to train and run inference on, due to the previous two points. (View Highlight)
    • Fine-tuning Difficulties: Historically, MoEs struggled with fine-tuning, often leading to overfitting. Although now there have been many advancements on this and it is getting easier. (View Highlight)