Low Rank Adaptation (Lora)
Fine-tuning large language models has become increasingly challenging due to the vast number of parameters involved and the computational resources required.
As models grow in size, traditional fine-tuning methods become impractical and costly.
Low-Rank Adaptation (LoRA) is a solution to this problem by efficiently adapting pre-trained language models to downstream tasks while significantly reducing the number of trainable parameters.
This document decomposes the famous October 2021 paper describing the technique.
The Intrinsic Rank Hypothesis that underpins LoRA
In a 2020 paper called "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning" from the team at Facebook it was found that pre-trained language models can still learn efficiently even when their parameters are randomly projected onto a.
The intrinsic rank hypothesis is a crucial concept that underlies the development of LoRA.
The hypothesis suggests that the model's parameters lie in a lower-dimensional subspace than the full parameter space, referred to as the "intrinsic dimension" of the model.
intrinsic dimension represents the minimum number of dimensions required to accurately capture the important information in the model's parameter space.
The intrinsic rank hypothesis extends this idea to weight updates that occur during fine-tuning of language models.
It posits that the updates to the weights also have a low "intrinsic rank", meaning they can be well-approximated by a low-rank matrix.
This hypothesis is significant because it implies that the necessary adjustments to the model during fine-tuning can be captured using a lower-dimensional representation, rather than updating the entire weight matrix.
By exploiting the intrinsic low rank of the weight updates, LoRA can significantly reduce the number of trainable parameters while still allowing for effective adaptation to downstream model tasks.
This not only makes fine-tuning more computationally efficient but also enables the adaptation of large language models on resource-constrained devices.
The intrinsic rank hypothesis is a fundamental principle that guides the design and implementation of LoRA.
Weight Matrices in Transformers
In the Transformer architecture, the self-attention layer is a key component that allows the model to attend to different positions of the input sequence.
The self-attention layer consists of multiple attention heads that operate in parallel.
Each attention head performs the following steps:
Compute the attention scores by taking the dot product of the query and key representations.
Scale the attention scores and apply a softmax function to obtain the attention weights.
Multiply the attention weights with the value representations to get the weighted values.
In the Transformer architecture, there are four weight matrices in the self-attention module:
Now, let's focus on the dimensions of the weight matrices:
Weight matrices enable the Transformer model to learn and capture complex relationships and dependencies within the input sequence. These learned weights in these matrices are crucial for the model's ability to attend to relevant information and generate meaningful representations.
In summary, the self-attention mechanism in the Transformer architecture relies on four key weight matrices: query, key, value, and output projection matrices.
These matrices linearly project the input into different subspaces, enabling the model to capture dependencies and learn meaningful representations. The dimensions of these matrices are typically determined by the model's hyperparameters, such as the dimensionality of the model and the number of attention heads.
Understanding the roles and dimensions of these weight matrices is crucial for grasping the concepts behind LoRA and the Intrinsic Rank Hypothesis, which we will explore in the next section.
How does LoRA work?
LoRA (Low-Rank Adaptation) introduces a modification to the weight matrices to efficiently adapt the pre-trained model to downstream tasks.
Let's see how LoRA gets involved in the process and influences the weights.
Recall the self-attention mechanism has four weight matrices:
These matrices are typically learned during the pre-training phase and have full rank.
In linear algebra, a matrix is said to have full rank if its rank is equal to the smaller of its number of rows or columns.
In the context of neural networks, this means that the weight matrices in dense layers are typically not low-rank - they cannot be exactly represented as a product of two smaller matrices.
LoRA modifies the weight matrices by introducing a low-rank decomposition of the weight updates.
where:
The key idea behind LoRA is to use a low-rank decomposition of the weight updates. By representing the weight updates as the product of two smaller matrices, A and B, LoRA allows for a more compact and efficient representation.
The rank (r) of the decomposition determines the size of these matrices and controls the expressiveness of the adaptation.
A lower rank results in fewer trainable parameters, making the adaptation more memory-efficient, while a higher rank allows for more flexibility in adapting the weights.
During the fine-tuning process with LoRA
A smaller rank r results in fewer trainable parameters and more efficient adaptation, while a larger rank r allows for more flexibility in adapting the weights.
What are the low rank matrices?
Matrices ( A ) and ( B )
This means that significant changes to the neural network can be captured using a lower-dimensional representation.
Impact of Lower Rank on Trainable Parameters
The reduction in the number of trainable parameters, as achieved through the Low-Rank Adaptation (LoRA) method, offers several significant benefits, particularly when fine-tuning large-scale neural networks:
What does 'rank' mean?
The idea here is to exploit the fact that matrices can contain “duplicate information” in the form of linear dependence.
This means we can use factorisation to represent a large matrix in terms of two smaller matrices.
This is similar to how a large number can be represented as the multiplication of two smaller numbers, a matrix can be thought of as the multiplication of two smaller matrices.
The rank of a matrix is the maximum number of linearly independent rows or columns in the matrix.
In other words, it is the dimension of the vector space spanned by the rows or columns of the matrix.
Example: Consider the following matrix M:
To find the rank of the matrix, we can use Gaussian elimination to convert the matrix into row echelon form.
Row echelon form is a type of matrix in which:
All nonzero rows are above any rows of all zeros.
The leading entry of each nonzero row after the first occurs to the right of the leading entry of the previous row.
The leading entry in any nonzero row is 1.
All entries in the column below a leading 1 are zeros.
Using Gaussian elimination, a matrix is transformed into row echelon form. This helps determine properties such as the rank and solve linear systems.
Gaussian elimination is a method used to solve systems of linear equations, find the rank r, and calculate the determinant of a matrix.
The process involves three main steps:
Forward Elimination: Transform the matrix into an upper triangular form.
Pivoting: Swap rows to position the highest absolute value as the pivot element to avoid division by zero and improve numerical stability.
Back Substitution: Solve for the variables starting from the last row upwards.
By converting the matrix into row echelon form, Gaussian elimination simplifies the system, making it easier to understand its properties and solutions.
After performing Gaussian elimination, we get matrix M:
The number of non-zero rows in the row echelon form is the rank of the matrix.
In this case, the rank is 2, which is the maximum number of linearly independent rows or columns in the matrix M.
By choosing a lower rank r, we enforce a low-dimensional structure on the weight update matrix, reducing the number of trainable parameters while still capturing the most important aspects of the adaptation.
So to sum up, in LoRA, the rank r is the hyperparameter that determines the size of the matrices A and B.
Specifically:
So, the updated equation is:
where:
Expressiveness and Rank
The rank r of the weight update matrix ∆W controls the expressiveness of the adaptation.
A higher rank allows for more flexibility in adapting the weights, as it can capture more complex patterns and transformations. However, increasing the rank also increases the number of trainable parameters.
On the other hand, a lower rank restricts the expressiveness of the adaptation but results in fewer trainable parameters.
A low-rank approximation of the weight update matrix can still capture the most important aspects of the adaptation while being more parameter-efficient.
The choice of rank r depends on the complexity of the downstream task and the available computational resources. It is a trade-off between expressiveness and efficiency.
The consensus is that when the data is similar to the data used in pre-training, a low rank r value is probably sufficient. When fine tuning on very new tasks, which might require substantial logical changes within the model, a high rank r value may work better.
Applying LoRA to Transformer Self Attention Weights
This allows for efficient adaptation without introducing additional inference latency, as the low-rank matrices can be merged with the pre-trained weights during deployment.
This makes it suitable for adapting large pre-trained language models, where standard fine-tuning would be prohibitively expensive in terms of storage and computation.
The problem with full fine tuning
To remind us of the problem LoRA is solving, below is a brief description of the language modeling problem for a full fine tuning. In particular, the maximisation of conditional probabilities given a task-specific prompt.
Language Modeling Objective
The goal is to adapt a pre-trained language model to downstream conditional text generation tasks.
The model learns to maximise the conditional probability of the target sequence given the context.
Parameter-Efficient Approach
The main drawback of full fine-tuning is that for each downstream task, a separate set of parameters ∆Φ is learned, which has the same dimension as the pre-trained weights Φ0.
This can be challenging to store and deploy, especially for large models.
To address this, the authors propose a parameter-efficient approach where the task-specific parameter increment ∆Φ is encoded by a much smaller set of parameters Θ, such that |Θ| << |Φ0|.
The objective becomes:
Here, ∆Φ is a function of Θ, denoted as ∆Φ(Θ).
The goal is to find the optimal Θ that maximises the conditional language modeling objective.
Low-Rank Representation
Issues with existing solutions
While there have been other Parameter Efficient Fine Tuning (PEFT) solutions for efficient model adaptation in transfer learning, such as adding adapter layers or optimising input layer activations, have limitations, especially in large-scale and latency-sensitive production scenarios.
We discuss below:
Adapter Layers
Adapter layers, as proposed by Houlsby et al. (2019) and Lin et al. (2020), are additional layers inserted into the Transformer architecture to enable parameter-efficient fine-tuning.
While adapters have fewer parameters compared to the original model, they introduce extra computation that must be processed sequentially, leading to increased inference latency.
The latency issue becomes more prominent in online inference settings where the batch size is small (for example a batch size of 1).
Even with a small bottleneck dimension in the adapter layers, the added latency can be significant (e.g., 20-30% slower for GPT-2 medium on a single GPU). This problem is amplified when using model parallelism techniques, as the additional depth requires more synchronous GPU operations.
Optimising Input Layer Activations (Prompt Tuning)
Another PEFT approach is prefix tuning (Li & Liang, 2021), which directly optimises a portion of the input layer activations (the prompt) while keeping the pre-trained model unchanged.
However, this method faces optimisation challenges and exhibits non-monotonic performance changes with respect to the number of trainable parameters.
Non-Monotonic Performance Changes in Prompt Tuning
Non-monotonic performance changes refer to fluctuations in model performance that do not consistently improve or degrade as the number of trainable parameters increases.
In the context of prompt tuning, this means that increasing the number of trainable parameters does not guarantee a corresponding increase in model performance.
Instead, performance may improve up to a point, then degrade, and potentially improve again, creating a non-linear relationship. These irregular performance patterns can complicate the optimisation process and make it challenging to determine the optimal parameter configuration for prompt tuning.
Lora in Practice
What subset of weight matrices should be adopted for maximum downstream performance?
The authors experimented with applying LoRA to different subsets of the self-attention weight matrices when training GPT3, with a budget of 18 million parameters (this compares to GPT-3's 175 billion parameters! The weight matrices are as follows:
This suggests that even with a low rank (e.g., r=4), adapting multiple weight matrices captures more useful information than adapting a single weight matrix with a higher rank.
What is the optimal rank for the adaptation matrix ∆W
The authors investigated the effect of the LoRA rank r on downstream performance.
They further analysed the subspace similarity between the learned adaptation matrices with different ranks (r=8 and r=64) and found that the top singular vector directions overlapped significantly, while the other directions did not.
This suggests that the additional directions learned with higher ranks might contain mostly random noise.
The authors conclude that the optimal adaptation matrix ∆W can indeed have a very low intrinsic rank, although they note that this may not hold for every task or dataset.
Connection between ∆W and W
The amplification factor is quite large (e.g., 21.5 for r=4 in the 48th layer of GPT-3).
This suggests that the low-rank adaptation matrix amplifies important features for specific downstream tasks that were learned but not emphasised in the general pre-training model.
Process for determining the optimal rank r for LoRA when fine-tuning
Start with a low rank r (e.g., r=1 or r=2) and fine-tune the model on the downstream task.
Gradually increase the rank r (e.g., r=4, r=8) and compare the performance on a validation set.
If increasing the rank leads to significant improvements, continue increasing rank r until the performance gains plateau or the computational cost becomes too high.
Compare the performance and computational cost of different combinations of rank and adapted weight matrices to find the optimal configuration for the specific downstream task and resource constraints.
Keep in mind that the optimal rank may vary depending on the complexity of the downstream task, the size of the dataset, and the similarity between the pre-training and downstream domains.
It's also important to consider the trade-off between performance and computational efficiency when choosing the rank.
Contents of A and B
Matrix matrices is initialised with a random Gaussian distribution.
These matrices learn to adapt the pre-trained weights to the specific downstream task by capturing the important patterns and transformations needed for the adaptation.
The learned values in these matrices represent the low-rank adaptation that modifies the pre-trained weights to better suit the downstream task.
Why LoRA is Better!
LoRA addresses the limitations of existing solutions by introducing a more efficient and effective approach to model adaptation:
No Inference Latency: Unlike adapter layers, LoRA does not introduce additional depth to the model. The low-rank adaptation matrices can be merged with the pre-trained weights after fine-tuning, resulting in no extra inference latency compared to a fully fine-tuned model.
Compute and Memory Efficiency: LoRA uses a low-rank representation to encode the task-specific parameter increments, significantly reducing the number of trainable parameters. This makes fine-tuning more compute- and memory-efficient, especially for large models.
Optimisation Stability: Compared to prompt tuning, LoRA optimises the model parameters directly, which leads to more stable optimisation and monotonic performance improvements with increased model capacity.
Sequence Length Preservation: LoRA does not require reserving a portion of the sequence length for adaptation, allowing the full sequence length to be used for downstream tasks, potentially leading to better performance.
Flexibility and Composability: LoRA is agnostic to the training objective and can be easily integrated with existing models and architectures. It is also composable with other adaptation techniques, such as prefix tuning, offering further flexibility.
Enhanced Compatibility: Works well alongside other fine-tuning techniques like adapters and prefix tuning.
Conclusion
LoRA (Low-Rank Adaptation) is a technique for fine-tuning large language models that takes advantage of the intrinsically low rank of the weight matrices.
Instead of updating the entire weight matrix during fine-tuning, LoRA decomposes the weight update matrix into two smaller matrices (A and B) with a lower rank r.
This significantly reduces the number of trainable parameters and memory requirements while maintaining performance comparable to full fine-tuning.
Key Insights
Swappable LoRA Modules
One of the most significant advantages of LoRA is the ability to swap different fine-tuned LoRA modules for various tasks on a single base model.
This allows for a more flexible and efficient deployment of models, as you can easily switch between tasks without needing separate fully fine-tuned models for each task.
Inference Time Swapping
The swappable nature of LoRA modules can be used even at inference time.
This means that customers can choose which task they want the model to perform on-the-fly, without requiring multiple models to be running simultaneously. This is a powerful feature that sets LoRA apart from other adaptation methods.
Potential for Further Optimisation
While LoRA is often applied to the attention weights (specifically query and value matrices) in transformer models, the technique could potentially be applied to other weight matrices in the model.
Exploring the application of LoRA to different components of the model architecture could lead to further optimizations and improvements.
Balancing Rank and Performance
The rank of the low-rank matrices (A and B) in LoRA is a crucial hyperparameter that determines the trade-off between model performance and efficiency. While lower ranks lead to greater memory savings and faster training, it's essential to find the right balance for each specific task. Experimenting with different rank values and evaluating the results on a validation set can help determine the optimal configuration.
Implications for Model Accessibility
By significantly reducing the memory requirements and training costs, LoRA makes fine-tuning large language models more accessible to a wider range of researchers and practitioners.
This could accelerate the development and deployment of specialized models for various tasks and domains.
Handling large datasets
Fine-tuning with LoRA on large datasets may still require significant computational resources, even though the number of trainable parameters is reduced. Strategies such as data parallelism, model parallelism, or gradient accumulation can be employed to handle large datasets efficiently.
Last updated