LORA: LOW-RANK ADAPTATION OF LARGE LAN-
GUAGE MODELS (View Highlight)
New highlights added October 23, 2023 at 2:13 PM
An important paradigm of natural language processing consists of large-scale pre- training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. (View Highlight)
We propose Low-Rank Adaptation, or LoRA, which freezes the pre- trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable pa- rameters for downstream tasks. (View Highlight)
LoRA performs on-par or better than fine- tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite hav- ing fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency (View Highlight)
Many applications in natural language processing rely on adapt- ing one large-scale, pre-trained language model to multiple down- stream applications. Such adaptation is usually done via fine-tuning, which updates all the parameters of the pre-trained model. The ma- jor downside of fine-tuning is that the new model contains as many parameters as in the original model. (View Highlight)
Many sought to mitigate this by adapting only some parameters or learning external modules for new tasks. This way, we only need to store and load a small number of task-specific parameters in ad- dition to the pre-trained model for each task, greatly boosting the operational efficiency when deployed. (View Highlight)
We take inspiration from Li et al. (2018a); Aghajanyan et al. (2020) which show that the learned over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the change in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed Low-Rank Adaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural network indirectly by optimizing rank decomposition matrices of the dense layers’ change during adaptation instead, while keeping the pre-trained weights frozen, as shown in Figure 1. (View Highlight)
New highlights added October 24, 2023 at 9:18 AM
Suppose we are given a pre-trained autoregressive language model PΦ(y|x) parametrized by Φ. For instance, PΦ(y|x) can be a generic multi-task learner such as GPT (Radford et al., b; Brown et al., 2020) based on the Transformer architecture (Vaswani et al., 2017). (View Highlight)
During full fine-tuning, the model is initialized to pre-trained weights Φ0 and updated to Φ0 +∆Φ by repeatedly following the gradient to maximize the conditional language modeling objective: max Φ (x,y)∈Z |y| t=1 log (PΦ(yt|x, y<t)) (View Highlight)
One of the main drawbacks for full fine-tuning is that for each downstream task, we learn a different set of parameters ∆Φ whose dimension |∆Φ| equals |Φ0|. Thus, if the pre-trained model is large (such as GPT-3 with |Φ0| ≈ 175 Billion), storing and deploying many independent instances of fine-tuned models can be challenging, if at all feasible. (View Highlight)
the task-specific parameter increment ∆Φ = ∆Φ(Θ) is further encoded by a much smaller-sized set of parameters Θ with |Θ| |Φ0|. The task of finding ∆Φ thus becomes optimizing over Θ: max Θ (x,y)∈Z |y| t=1 log pΦ0+∆Φ(Θ)(yt|x, y<t) (2) (View Highlight)
Since the inception of transfer learning, dozens of works have sought to make model adaptation more parameter- and compute-efficient. (View Highlight)
Using language modeling as an example, there are two prominent strategies when it comes to efficient adaptations: adding adapter layers (Houlsby et al., 2019; Rebuffi et al., 2017; Pfeiffer et al., 2021; R¨uckl´e et al., 2020) or optimizing some forms of the input layer activations (Li & Liang, 2021; Lester et al., 2021; Hambardzumyan et al., 2020; Liu et al., 2021). However, both strategies have their limitations, especially in a large-scale and latency-sensitive production scenario. (View Highlight)
Adapter Layers Introduce Inference Latency There are many variants of adapters. We focus on the original design by Houlsby et al. (2019) which has two adapter layers per Transformer block and a more recent one by Lin et al. (2020) which has only one per block but with an additional LayerNorm (Ba et al., 2016). While one can reduce the overall latency by pruning layers or exploit- ing multi-task settings (R¨uckl´e et al., 2020; Pfeiffer et al., 2021), there is no direct ways to bypass the extra compute in adapter layers. This seems like a non-issue since adapter layers are designed to have few parameters (sometimes <1% of the original model) by having a small bottleneck di- mension, which limits the FLOPs they can add. However, large neural networks rely on hardware parallelism to keep the latency low, and adapter layers have to be processed sequentially. This makes a difference in the online inference setting where the batch size is typically as small as one. In a generic scenario without model parallelism, such as running inference on GPT-2 (Radford et al., b) medium on a single GPU, we see a noticeable increase in latency when using adapters, even with a very small bottleneck dimension (View Highlight)
This problem gets worse when we need to shard the model as done in Shoeybi et al. (2020); Lep- ikhin et al. (2020), because the additional depth requires more synchronous GPU operations such as AllReduce and Broadcast, unless we store the adapter parameters redundantly many times. (View Highlight)
Directly Optimizing the Prompt is Hard The other direction, as exemplified by prefix tuning (Li & Liang, 2021), faces a different challenge. We observe that prefix tuning is difficult to optimize and that its performance changes non-monotonically in trainable parameters, confirming similar observations in the original paper. (View Highlight)
A neural network contains many dense layers which perform matrix multiplication. The weight matrices in these layers typically have full-rank. (View Highlight)
When adapting to a specific task, Aghajanyan et al. (2020) shows that the pre-trained language models have a low “instrisic dimension” and can still learn efficiently despite a random projection to a smaller subspace. Inspired by this, we hypothe- size the updates to the weights also have a low “intrinsic rank” during adaptation. For a pre-trained weight matrix W0 ∈ Rd×k, we constrain its update by representing the latter with a low-rank de- composition W0 +∆W = W0 + BA, where B ∈ Rd×r, A ∈ Rr×k, and the rank r min(d, k). During training, W0 is frozen and does not receive gradient updates, while A and B contain trainable parameters. N (View Highlight)
We use a random Gaussian initialization for A and zero for B, so ∆W = BA is zero at the beginning of training. We then scale ∆Wx by α r , where α is a constant in r. When optimizing with Adam, tuning α is roughly the same as tuning the learning rate if we scale the initialization appropriately. As a result, we simply set α to the first r we try and do not tune it. This scaling helps to reduce the need to retune hyperparameters when we vary r (View Highlight)
A more general form of fine-tuning allows the training of a subset of the pre-trained parameters. LoRA takes a step further and does not require the accumu- lated gradient update to weight matrices to have full-rank during adaptation. This means that when applying LoRA to all weight matrices and training all biases2, we roughly recover the expressive- ness of full fine-tuning by setting the LoRA rank r to the rank of the pre-trained weight matrices. (View Highlight)
No Additional Inference Latency. When deployed in production, we can explicitly compute and store W = W0 + BA and perform inference as usual. Note that both W0 and BA are in Rd×k. When we need to switch to another downstream task, we can recover W0 by subtracting BA and then adding a different BA, a quick operation with very little memory overhead. (View Highlight)
n principle, we can apply LoRA to any subset of weight matrices in a neural network to reduce the number of trainable parameters. In the Transformer architecture, there are four weight matrices in the self-attention module (Wq,Wk,Wv,Wo) and two in the MLP module. (View Highlight)
We limit our study to only adapting the attention weights for downstream tasks and freeze the MLP modules (so they are not trained in downstream tasks) both for simplicity and parameter-efficiency (View Highlight)
he most significant benefit comes from the reduction in memory and storage usage. For a large Transformer trained with Adam, we reduce that VRAM usage by up to 2/3 if r dmodel as we do not need to store the optimizer states for the frozen parameters. On GPT-3 175B, we reduce the VRAM consumption during training from 1.2TB to 350GB. With r = 4 and only the query and value projection matrices being adapted, the checkpoint size is reduced by roughly 10,000× (from 350GB to 35MB)4. This allows us to train with signifi- cantly fewer GPUs and avoid I/O bottlenecks. Another benefit is that we can switch between tasks while deployed at a much lower cost by only swapping the LoRA weights as opposed to all the parameters. (View Highlight)
Given the empirical advantage of LoRA, we hope to further explain the properties of the low-rank adaptation learned from downstream tasks. Note that the low-rank structure not only lowers the hardware barrier to entry which allows us to run multiple experiments in parallel, but also gives better interpretability of how the update weights are correlated with the pre-trained weights (View Highlight)
This suggests that even a rank of four captures enough information in ∆W such that it is preferable to adapt more weight matrices than adapting a single type of weights with a larger rank. (View Highlight)
Table 6 shows that, surprisingly, LoRA already performs competitively with a very small r (more so for {Wq,Wv} than just Wq). This suggests the update matrix ∆W could have a very small “intrinsic rank”.6 To further support this finding, we check the overlap of the subspaces learned by different choices of r and by different random seeds. We argue that increasing r does not cover a more meaningful subspace, which suggests that a low-rank adaptation matrix is sufficient. (View Highlight)
We draw several conclusions from Table 7. First, ∆W has a stronger correlation with W compared to a random matrix, indicating that ∆W amplifies some features that are already in W. Second, instead of repeating the top singular directions of W, ∆W only amplifies directions that are not emphasized in W. Third, the amplification factor is rather huge: 21.5 ≈ 6.91/0.32 for r = 4. See Section H.4 for why r = 64 has a smaller amplification factor. We also provide a visualization in Section H.3 for how the correlation changes as we include more top singular directions from Wq. This suggests that the low-rank adaptation matrix potentially amplifies the important features for specific downstream tasks that were learned but not emphasized in the general pre-training model. (View Highlight)
Fine-tuning enormous language models is prohibitively expensive in terms of the hardware required and the storage/switching cost for hosting independent instances for different tasks. We propose LoRA, an efficient adaptation strategy that neither introduces inference latency nor reduces input sequence length while retaining high model quality. Importantly, it allows for quick task-switching when deployed as a service by sharing the vast majority of the model parameters. While we focused on Transformer language models, the proposed principles are generally applicable to any neural networks with dense layers (View Highlight)