rw-book-cover

Metadata

Highlights

  • As an organization, building a multitude of models via fine-tuning makes sense for multiple reasons. (View Highlight)
  • Performance - There is compelling evidence that smaller, specialized models outperform their larger, general-purpose counterparts on the tasks that they were trained on. Predibase [5] showed that you can get better performance than GPT-4 using task-specific LoRAs with a base like mistralai/Mistral-7B-v0.1. (View Highlight)
  • Adaptability - Models like Mistral or Llama are extremely versatile. You can pick one of them as your base model and build many specialized models, even when the downstream tasks are very different. Also, note that you aren’t locked in as you can easily swap that base and fine-tune it with your data on another base (more on this later). (View Highlight)
  • Independence - For each task that your organization cares about, different teams can work on different fine tunes, allowing for independence in data preparation, configurations, evaluation criteria, and cadence of model updates. (View Highlight)
  • Privacy - Specialized models offer flexibility with training data segregation and access restrictions to different users based on data privacy requirements. Additionally, in cases where running models locally is important, a small model can be made highly capable for a specific task while keeping its size small enough to run on device. (View Highlight)
  • In summary, fine-tuning enables organizations to unlock the value of their data, and this advantage becomes especially significant, even game-changing, when organizations use highly specialized data that is uniquely theirs. (View Highlight)
  • So, where is the catch? Deploying and serving Large Language Models (LLMs) is challenging in many ways. Cost and operational complexity are key considerations when deploying a single model, let alone n models. This means that, for all its glory, fine-tuning complicates LLM deployment and serving even further. (View Highlight)
  • Now that we understand the basic idea of model adaptation introduced by LoRA, we are ready to delve into multi-LoRA serving. The concept is simple: given one base pre-trained model and many different tasks for which you have fine-tuned specific LoRAs, multi-LoRA serving is a mechanism to dynamically pick the desired LoRA based on the incoming request. (View Highlight)
  • Multi-LoRA serving enables you to deploy the base model just once. And since the LoRA adapters are small, you can load many adapters. Note the exact number will depend on your available GPU resources and what model you deploy. What you end up with is effectively equivalent to having multiple fine-tuned models in one single deployment. (View Highlight)
  • LoRAs (the adapter weights) can vary based on rank and quantization, but they are generally quite tiny. Let’s get a quick intuition of how small these adapters are: predibase/magicoder is 13.6MB, which is less than 1/1000th the size of mistralai/Mistral-7B-v0.1, which is 14.48GB. In relative terms, loading 30 adapters into RAM results in only a 3% increase in VRAM. Ultimately, this is not an issue for most deployments. Hence, we can have one deployment for many models. (View Highlight)
  • First, you need to train your LoRA models and export the adapters. You can find a guide here on fine-tuning LoRA adapters. Do note that when you push your fine-tuned model to the Hub, you only need to push the adapter, not the full merged model. When loading a LoRA adapter from the Hub, the base model is inferred from the adapter model card and loaded separately again. For deeper support, please check out our Expert Support Program. The real value will come when you create your own LoRAs for your specific use cases. (View Highlight)
  • For some organizations, it may be hard to train one LoRA for every use case, as they may lack the expertise or other resources. Even after you choose a base and prepare your data, you will need to keep up with the latest techniques, explore hyperparameters, find optimal hardware resources, write the code, and then evaluate. This can be quite a task, even for experienced teams. (View Highlight)
  • AutoTrain can lower this barrier to entry significantly. AutoTrain is a no-code solution that allows you to train machine learning models in just a few clicks. There are a number of ways to use AutoTrain. In addition to locally/on-prem we have: (View Highlight)
  • predibase/customer_support is trained on the Gridspace-Stanford Harper Valley speech dataset which enhances its ability to understand and respond to customer service interactions accurately. This improves the model’s performance in tasks such as speech recognition, emotion detection, and dialogue management, leading to more efficient and empathetic customer support. (View Highlight)
  • predibase/magicoder is trained on ise-uiuc/Magicoder-OSS-Instruct-75K which is a code instruction dataset that is synthetically generated. (View Highlight)
  • Inference Endpoints allows you to have access to deploy any Hugging Face model on many GPUs and alternative Hardware types across AWS, GCP, and Azure all in a few clicks! In the GUI, it’s easy to deploy. Under the hood, we use TGI by default for text generation (though you have the option to use any image you choose). (View Highlight)
  • We are not the first to climb this summit, as discussed below. The team behind LoRAX, Predibase, has an excellent write up. Do check it out, as this section is based on their work. (View Highlight)
  • One of the big benefits of Multi-LoRA serving is that you don’t need to have multiple deployments for multiple models, and ultimately this is much much cheaper. This should match your intuition as multiple models will need all the weights and not just the small adapter layer. As you can see in Figure 5, even when we add many more models with TGI Multi-LoRA the cost is the same per token. The cost for TGI dedicated scales as you need a new deployment for each fine-tuned model. (View Highlight)
  • One real-world challenge when you deploy multiple models is that you will have a strong variance in your usage patterns. Some models might have low usage; some might be bursty, and some might be high frequency. This makes it really hard to scale, especially when each model is independent. There are a lot of “rounding” errors when you have to add another GPU, and that adds up fast. In an ideal world, you would maximize your GPU utilization per GPU and not use any extra. You need to make sure you have access to enough GPUs, knowing some will be idle, which can be quite tedious. (View Highlight)
  • When we consolidate with Multi-LoRA, we get much more stable usage. We can see the results of this in Figure 6 where the Multi-Lora Serving pattern is quite stable even though it consists of more volatile patterns. By consolidating the models, you allow much smoother usage and more manageable scaling. Do note that these are just illustrative patterns, but think through your own patterns and how Multi-LoRA can help. Scale 1 model, not 30! (View Highlight)
  • What happens in the real world with AI moving at breakneck speeds? What if you want to choose a different/newer model as your base? While our examples use mistralai/Mistral-7B-v0.1 as a base model, there are other bases like Mistral’s v0.3 which supports function calling, and altogether different model families like Llama 3. In general, we expect new base models that are more efficient and more performant to come out all the time. (View Highlight)
  • But worry not! It is easy enough to re-train the LoRAs if you have a compelling reason to update your base model. Training is relatively cheap; in fact Predibase found it costs only ~$8.00 to train each one. The amount of code changes is minimal with modern frameworks and common engineering practices: (View Highlight)
  • Multi-LoRA serving represents a transformative approach in the deployment of AI models, providing a solution to the cost and complexity barriers associated with managing multiple specialized models. By leveraging a single base model and dynamically applying fine-tuned adapters, organizations can significantly reduce operational overhead while maintaining or even enhancing performance across diverse tasks. AI Directors we ask you to be bold, choose a base model and embrace the Multi-LoRA paradigm, the simplicity and cost savings will pay off in dividends. Let Multi-LoRA be the cornerstone of your AI strategy, ensuring your organization stays ahead in the rapidly evolving landscape of technology. (View Highlight)
  • Implementing Multi-LoRA serving can be really tricky, but due to awesome work by punica-ai and the lorax team, optimized kernels and frameworks have been developed to make this process more efficient. TGI leverages these optimizations in order to provide fast and efficient inference with multiple LoRA models. (View Highlight)