Low Rank Adaptation (Lora)
Fine-tuning large language models has become increasingly challenging due to the vast number of parameters involved and the computational resources required.
As models grow in size, traditional fine-tuning methods become impractical and costly.
Low-Rank Adaptation (LoRA) is a solution to this problem by efficiently adapting pre-trained language models to downstream tasks while significantly reducing the number of trainable parameters.
This document decomposes the famous October 2021 paper describing the technique.
The Intrinsic Rank Hypothesis that underpins LoRA
In a 2020 paper called "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning" from the team at Facebook it was found that pre-trained language models can still learn efficiently even when their parameters are randomly projected onto a.
The intrinsic rank hypothesis is a crucial concept that underlies the development of LoRA.
The hypothesis suggests that the model's parameters lie in a lower-dimensional subspace than the full parameter space, referred to as the "intrinsic dimension" of the model.
intrinsic dimension represents the minimum number of dimensions required to accurately capture the important information in the model's parameter space.
The intrinsic rank hypothesis extends this idea to weight updates that occur during fine-tuning of language models.
It posits that the updates to the weights also have a low "intrinsic rank", meaning they can be well-approximated by a low-rank matrix.
This hypothesis is significant because it implies that the necessary adjustments to the model during fine-tuning can be captured using a lower-dimensional representation, rather than updating the entire weight matrix.
By exploiting the intrinsic low rank of the weight updates, LoRA can significantly reduce the number of trainable parameters while still allowing for effective adaptation to downstream model tasks.
This not only makes fine-tuning more computationally efficient but also enables the adaptation of large language models on resource-constrained devices.
The intrinsic rank hypothesis is a fundamental principle that guides the design and implementation of LoRA.
Weight Matrices in Transformers
In the Transformer architecture, the self-attention layer is a key component that allows the model to attend to different positions of the input sequence.
The self-attention mechanism is applied to the input embeddings or the output of the previous layer, which we'll denote as , with shape .
The self-attention layer consists of multiple attention heads that operate in parallel.
Each attention head performs the following steps:
Linearly project the input into query, key, and value representations using the corresponding weight matrices ( , , ).
Compute the attention scores by taking the dot product of the query and key representations.
Scale the attention scores and apply a softmax function to obtain the attention weights.
Multiply the attention weights with the value representations to get the weighted values.
Concatenate the weighted values from all attention heads and linearly project them using the output projection matrix .
In the Transformer architecture, there are four weight matrices in the self-attention module:
Now, let's focus on the dimensions of the weight matrices:
Query matrix :
Key matrix :
Value matrix :
Output projection matrix :
Weight matrices enable the Transformer model to learn and capture complex relationships and dependencies within the input sequence. These learned weights in these matrices are crucial for the model's ability to attend to relevant information and generate meaningful representations.
In summary, the self-attention mechanism in the Transformer architecture relies on four key weight matrices: query, key, value, and output projection matrices.
These matrices linearly project the input into different subspaces, enabling the model to capture dependencies and learn meaningful representations. The dimensions of these matrices are typically determined by the model's hyperparameters, such as the dimensionality of the model and the number of attention heads.
Understanding the roles and dimensions of these weight matrices is crucial for grasping the concepts behind LoRA and the Intrinsic Rank Hypothesis, which we will explore in the next section.
How does LoRA work?
LoRA (Low-Rank Adaptation) introduces a modification to the weight matrices to efficiently adapt the pre-trained model to downstream tasks.
Let's see how LoRA gets involved in the process and influences the weights.
Recall the self-attention mechanism has four weight matrices:
Query matrix
Key matrix
Value matrix
Output projection matrix
These matrices are typically learned during the pre-training phase and have full rank.
In linear algebra, a matrix is said to have full rank if its rank is equal to the smaller of its number of rows or columns.
In the context of neural networks, this means that the weight matrices in dense layers are typically not low-rank - they cannot be exactly represented as a product of two smaller matrices.
LoRA modifies the weight matrices by introducing a low-rank decomposition of the weight updates.
Instead of directly updating the pre-trained weight matrices, LoRA represents the weight updates using two smaller matrices, and , such that:
where:
is the original pre-trained weight matrix ()
is the weight update matrix
is a matrix of size ), where is the rank of the decomposition
is a matrix of size
The key idea behind LoRA is to use a low-rank decomposition of the weight updates. By representing the weight updates as the product of two smaller matrices, A and B, LoRA allows for a more compact and efficient representation.
The rank (r) of the decomposition determines the size of these matrices and controls the expressiveness of the adaptation.
A lower rank results in fewer trainable parameters, making the adaptation more memory-efficient, while a higher rank allows for more flexibility in adapting the weights.
During the fine-tuning process with LoRA
The method involves freezing the original model weights and adjusting only two smaller matrices, and .
The pre-trained weight matrices remain frozen and do not receive gradient updates.
The matrices and are initialised randomly and are the only trainable parameters. Matrix is initialised with a random Gaussian distribution, while matrix is initialised with zeros.
In the forward pass, the input is multiplied with both the pre-trained weight matrix and the LoRA weight update matrix . The results are then summed element-wise to obtain the updated output.
During backpropagation, gradients are computed only for the and matrices, while the pre-trained weight matrices remain unchanged.
The optimisation process updates the and matrices based on the gradients, allowing the model to adapt to the downstream task.
By representing the weight updates using a low-rank decomposition , LoRA significantly reduces the number of trainable parameters.
The rank r determines the size of the matrices and and controls the expressiveness of the adaptation.
A smaller rank r results in fewer trainable parameters and more efficient adaptation, while a larger rank r allows for more flexibility in adapting the weights.
What are the low rank matrices?
In the Low-Rank Adaptation (LoRA) method proposed in this paper, the terms and refer to the low-rank matrices used to approximate the weight update matrix during adaptation.
As discussed, the weight matrix being targeted is part of the Transformer architecture, specifically the weight matrices in the self-attention module.
Matrices ( A ) and ( B )
As highlighted. the authors of Lora hypothesised that during fine-tuning, the updates to the weights have a low "intrinsic rank", meaning they can be well-approximated by a low-rank matrix.
This means that significant changes to the neural network can be captured using a lower-dimensional representation.
Essentially, the idea is that not all elements of ( ) are equally important; instead, a smaller subset of these changes can effectively encapsulate the necessary adjustments.
Building on this hypothesis, LoRA proposes representing ( ) as the product of two smaller matrices, and , with a lower rank.
denotes the relative change to the initial value when it has been trained.
The updated weight matrix thus becomes:
In this equation, remains frozen (i.e., it is not updated during training).
The matrices and are of lower dimensionality, with their product representing a low-rank approximation of .
Impact of Lower Rank on Trainable Parameters
By choosing matrices and to have a lower rank , the number of trainable parameters is significantly reduced.
For example, if is a matrix, traditionally, updating would involve parameters.
However, with and of sizes and respectively, the total number of parameters reduces to , which is much smaller when
The reduction in the number of trainable parameters, as achieved through the Low-Rank Adaptation (LoRA) method, offers several significant benefits, particularly when fine-tuning large-scale neural networks:
What does 'rank' mean?
The term "rank" in the context of LoRA refers to the rank of the weight update matrix , which is approximated by the product of two smaller matrices and .
The idea here is to exploit the fact that matrices can contain “duplicate information” in the form of linear dependence.
This means we can use factorisation to represent a large matrix in terms of two smaller matrices.
This is similar to how a large number can be represented as the multiplication of two smaller numbers, a matrix can be thought of as the multiplication of two smaller matrices.
The rank of a matrix is the maximum number of linearly independent rows or columns in the matrix.
In other words, it is the dimension of the vector space spanned by the rows or columns of the matrix.
Example: Consider the following matrix M:
The rows of this matrix are not linearly independent because the third row is a linear combination of the first two rows .
Similarly, the columns are not linearly independent because the third column is a linear combination of the first two columns
To find the rank of the matrix, we can use Gaussian elimination to convert the matrix into row echelon form.
Row echelon form is a type of matrix in which:
All nonzero rows are above any rows of all zeros.
The leading entry of each nonzero row after the first occurs to the right of the leading entry of the previous row.
The leading entry in any nonzero row is 1.
All entries in the column below a leading 1 are zeros.
Using Gaussian elimination, a matrix is transformed into row echelon form. This helps determine properties such as the rank and solve linear systems.
Gaussian elimination is a method used to solve systems of linear equations, find the rank r, and calculate the determinant of a matrix.
The process involves three main steps:
Forward Elimination: Transform the matrix into an upper triangular form.
Pivoting: Swap rows to position the highest absolute value as the pivot element to avoid division by zero and improve numerical stability.
Back Substitution: Solve for the variables starting from the last row upwards.
By converting the matrix into row echelon form, Gaussian elimination simplifies the system, making it easier to understand its properties and solutions.
After performing Gaussian elimination, we get matrix M:
The number of non-zero rows in the row echelon form is the rank of the matrix.
In this case, the rank is 2, which is the maximum number of linearly independent rows or columns in the matrix M.
In the context of LoRA, the rank r determines the dimensionality of the subspace in which the weight update matrix is approximated.
By choosing a lower rank r, we enforce a low-dimensional structure on the weight update matrix, reducing the number of trainable parameters while still capturing the most important aspects of the adaptation.
So to sum up, in LoRA, the rank r is the hyperparameter that determines the size of the matrices A and B.
Specifically:
Matrix has shape , where r is the rank of the decomposition and is the dimension of the model.
Matrix has dimensions
So, the updated equation is:
where:
is the original pre-trained weight matrix
is the weight update matrix
is a matrix of size
is a matrix of size
The product of and results in a matrix of shape , which has rank r.
By choosing a smaller value for r, we enforce a low-rank structure on the weight update matrix .
Expressiveness and Rank
The rank r of the weight update matrix ∆W controls the expressiveness of the adaptation.
A higher rank allows for more flexibility in adapting the weights, as it can capture more complex patterns and transformations. However, increasing the rank also increases the number of trainable parameters.
On the other hand, a lower rank restricts the expressiveness of the adaptation but results in fewer trainable parameters.
This is because the matrices and have fewer elements when r is smaller.
A low-rank approximation of the weight update matrix can still capture the most important aspects of the adaptation while being more parameter-efficient.
The choice of rank r depends on the complexity of the downstream task and the available computational resources. It is a trade-off between expressiveness and efficiency.
The consensus is that when the data is similar to the data used in pre-training, a low rank r value is probably sufficient. When fine tuning on very new tasks, which might require substantial logical changes within the model, a high rank r value may work better.
Applying LoRA to Transformer Self Attention Weights
LoRA can be applied to any or all of the self-attention weight matrices in each Transformer layer.
The paper primarily focuses on adapting only the query and value matrices because they play the most critical role in capturing and transforming the input representations.
During the forward pass, the adapted weight matrix is computed as , where is the pre-trained weight matrix and is the low-rank adaptation term.
This allows for efficient adaptation without introducing additional inference latency, as the low-rank matrices can be merged with the pre-trained weights during deployment.
The key idea behind LoRA is to exploit the low-rank structure of the adaptation matrix .
By approximating with low-rank matrices and , LoRA significantly reduces the number of trainable parameters while still allowing for effective adaptation to downstream tasks.
This makes it suitable for adapting large pre-trained language models, where standard fine-tuning would be prohibitively expensive in terms of storage and computation.
The problem with full fine tuning
To remind us of the problem LoRA is solving, below is a brief description of the language modeling problem for a full fine tuning. In particular, the maximisation of conditional probabilities given a task-specific prompt.
represents a pre-trained autoregressive language model, where Φ denotes the model parameters.
is a dataset of context-target pairs for a downstream task, where is the context and is the target sequence.
represents the initial pre-trained weights of the model.
represents the task-specific parameter increment during fine-tuning.
is a smaller set of parameters used to encode .
Language Modeling Objective
The goal is to adapt a pre-trained language model to downstream conditional text generation tasks.
The model learns to maximise the conditional probability of the target sequence given the context.
During full fine-tuning, the objective is to maximise the sum of log probabilities of each token in the target sequence , conditioned on the context and the previous tokens .
This is achieved by updating the model parameters from to through :
The notation indicates that the summation is performed over all context-target pairs in the dataset denotes the length of the target sequence .
Parameter-Efficient Approach
The main drawback of full fine-tuning is that for each downstream task, a separate set of parameters ∆Φ is learned, which has the same dimension as the pre-trained weights Φ0.
This can be challenging to store and deploy, especially for large models.
To address this, the authors propose a parameter-efficient approach where the task-specific parameter increment ∆Φ is encoded by a much smaller set of parameters Θ, such that |Θ| << |Φ0|.
The objective becomes:
Here, ∆Φ is a function of Θ, denoted as ∆Φ(Θ).
The goal is to find the optimal Θ that maximises the conditional language modeling objective.
Low-Rank Representation
The authors propose to use a low-rank representation to encode , which is both compute- and memory-efficient.
This means that is represented as a product of smaller matrices, reducing the number of parameters needed to store and update during fine-tuning.
The key idea is to significantly reduce the number of trainable parameters compared to the size of the pre-trained weights .
For example, when using GPT-3 175 billion parameter model as the pre-trained model, the number of trainable parameters can be as small as 0.01% of , greatly reducing the storage and computational requirements for fine-tuning on downstream tasks.
Issues with existing solutions
While there have been other Parameter Efficient Fine Tuning (PEFT) solutions for efficient model adaptation in transfer learning, such as adding adapter layers or optimising input layer activations, have limitations, especially in large-scale and latency-sensitive production scenarios.
We discuss below:
Adapter Layers
Adapter layers, as proposed by Houlsby et al. (2019) and Lin et al. (2020), are additional layers inserted into the Transformer architecture to enable parameter-efficient fine-tuning.
While adapters have fewer parameters compared to the original model, they introduce extra computation that must be processed sequentially, leading to increased inference latency.
The latency issue becomes more prominent in online inference settings where the batch size is small (for example a batch size of 1).
Even with a small bottleneck dimension in the adapter layers, the added latency can be significant (e.g., 20-30% slower for GPT-2 medium on a single GPU). This problem is amplified when using model parallelism techniques, as the additional depth requires more synchronous GPU operations.
Optimising Input Layer Activations (Prompt Tuning)
Another PEFT approach is prefix tuning (Li & Liang, 2021), which directly optimises a portion of the input layer activations (the prompt) while keeping the pre-trained model unchanged.
However, this method faces optimisation challenges and exhibits non-monotonic performance changes with respect to the number of trainable parameters.
Non-Monotonic Performance Changes in Prompt Tuning
Non-monotonic performance changes refer to fluctuations in model performance that do not consistently improve or degrade as the number of trainable parameters increases.
In the context of prompt tuning, this means that increasing the number of trainable parameters does not guarantee a corresponding increase in model performance.
Instead, performance may improve up to a point, then degrade, and potentially improve again, creating a non-linear relationship. These irregular performance patterns can complicate the optimisation process and make it challenging to determine the optimal parameter configuration for prompt tuning.
Lora in Practice
What subset of weight matrices should be adopted for maximum downstream performance?
The authors experimented with applying LoRA to different subsets of the self-attention weight matrices when training GPT3, with a budget of 18 million parameters (this compares to GPT-3's 175 billion parameters! The weight matrices are as follows:
They found that adapting both the query weights and value weights yielded the best performance on downstream tasks like and .
Adapting , , , and together also performed well, but adapting only or resulted in significantly lower performance.
This suggests that even with a low rank (e.g., r=4), adapting multiple weight matrices captures more useful information than adapting a single weight matrix with a higher rank.
What is the optimal rank for the adaptation matrix ∆W
The authors investigated the effect of the LoRA rank r on downstream performance.
Surprisingly, they found that a rank as low as r=1 was sufficient for adapting both and on the datasets they tested, while adapting alone required a larger rank r.
They further analysed the subspace similarity between the learned adaptation matrices with different ranks (r=8 and r=64) and found that the top singular vector directions overlapped significantly, while the other directions did not.
This suggests that the additional directions learned with higher ranks might contain mostly random noise.
The authors conclude that the optimal adaptation matrix ∆W can indeed have a very low intrinsic rank, although they note that this may not hold for every task or dataset.
Connection between ∆W and W
To investigate the relationship between the adaptation matrix and the pre-trained weight matrix , the authors projected onto the r-dimensional subspace of and compared the .
They found that has a stronger correlation with compared to a random matrix, indicating that amplifies some features that are already present in .
However, instead of amplifying the top singular directions of , emphasises directions that are not as prominent in .
The amplification factor is quite large (e.g., 21.5 for r=4 in the 48th layer of GPT-3).
This suggests that the low-rank adaptation matrix amplifies important features for specific downstream tasks that were learned but not emphasised in the general pre-training model.
Process for determining the optimal rank r for LoRA when fine-tuning
Start with a low rank r (e.g., r=1 or r=2) and fine-tune the model on the downstream task.
Gradually increase the rank r (e.g., r=4, r=8) and compare the performance on a validation set.
If increasing the rank leads to significant improvements, continue increasing rank r until the performance gains plateau or the computational cost becomes too high.
If the performance is already good with a low rank, try adapting additional weight matrices (e.g., and together) with the same low rank.
Compare the performance and computational cost of different combinations of rank and adapted weight matrices to find the optimal configuration for the specific downstream task and resource constraints.
Keep in mind that the optimal rank may vary depending on the complexity of the downstream task, the size of the dataset, and the similarity between the pre-training and downstream domains.
It's also important to consider the trade-off between performance and computational efficiency when choosing the rank.
Contents of A and B
The matrices and contain learned parameters that are updated during the fine-tuning process. They are initialised randomly at the beginning of training:
Matrix matrices is initialised with a random Gaussian distribution.
Matrix is initialised with zeros.
During training, the values in and are updated based on the gradients computed during backpropagation.
These matrices learn to adapt the pre-trained weights to the specific downstream task by capturing the important patterns and transformations needed for the adaptation.
The content of and is learned through the optimisation process and depends on the specific task and dataset being fine-tuned on.
The learned values in these matrices represent the low-rank adaptation that modifies the pre-trained weights to better suit the downstream task.
Why LoRA is Better!
LoRA addresses the limitations of existing solutions by introducing a more efficient and effective approach to model adaptation:
No Inference Latency: Unlike adapter layers, LoRA does not introduce additional depth to the model. The low-rank adaptation matrices can be merged with the pre-trained weights after fine-tuning, resulting in no extra inference latency compared to a fully fine-tuned model.
Compute and Memory Efficiency: LoRA uses a low-rank representation to encode the task-specific parameter increments, significantly reducing the number of trainable parameters. This makes fine-tuning more compute- and memory-efficient, especially for large models.
Optimisation Stability: Compared to prompt tuning, LoRA optimises the model parameters directly, which leads to more stable optimisation and monotonic performance improvements with increased model capacity.
Sequence Length Preservation: LoRA does not require reserving a portion of the sequence length for adaptation, allowing the full sequence length to be used for downstream tasks, potentially leading to better performance.
Flexibility and Composability: LoRA is agnostic to the training objective and can be easily integrated with existing models and architectures. It is also composable with other adaptation techniques, such as prefix tuning, offering further flexibility.
Enhanced Compatibility: Works well alongside other fine-tuning techniques like adapters and prefix tuning.
Conclusion
LoRA (Low-Rank Adaptation) is a technique for fine-tuning large language models that takes advantage of the intrinsically low rank of the weight matrices.
Instead of updating the entire weight matrix during fine-tuning, LoRA decomposes the weight update matrix into two smaller matrices (A and B) with a lower rank r.
This significantly reduces the number of trainable parameters and memory requirements while maintaining performance comparable to full fine-tuning.
Key Insights
Swappable LoRA Modules
One of the most significant advantages of LoRA is the ability to swap different fine-tuned LoRA modules for various tasks on a single base model.
This allows for a more flexible and efficient deployment of models, as you can easily switch between tasks without needing separate fully fine-tuned models for each task.
Inference Time Swapping
The swappable nature of LoRA modules can be used even at inference time.
This means that customers can choose which task they want the model to perform on-the-fly, without requiring multiple models to be running simultaneously. This is a powerful feature that sets LoRA apart from other adaptation methods.
Potential for Further Optimisation
While LoRA is often applied to the attention weights (specifically query and value matrices) in transformer models, the technique could potentially be applied to other weight matrices in the model.
Exploring the application of LoRA to different components of the model architecture could lead to further optimizations and improvements.
Balancing Rank and Performance
The rank of the low-rank matrices (A and B) in LoRA is a crucial hyperparameter that determines the trade-off between model performance and efficiency. While lower ranks lead to greater memory savings and faster training, it's essential to find the right balance for each specific task. Experimenting with different rank values and evaluating the results on a validation set can help determine the optimal configuration.
Implications for Model Accessibility
By significantly reducing the memory requirements and training costs, LoRA makes fine-tuning large language models more accessible to a wider range of researchers and practitioners.
This could accelerate the development and deployment of specialized models for various tasks and domains.
Handling large datasets
Fine-tuning with LoRA on large datasets may still require significant computational resources, even though the number of trainable parameters is reduced. Strategies such as data parallelism, model parallelism, or gradient accumulation can be employed to handle large datasets efficiently.
Last updated