Low Rank Adaptation (Lora)

Fine-tuning large language models has become increasingly challenging due to the vast number of parameters involved and the computational resources required.

As models grow in size, traditional fine-tuning methods become impractical and costly.

Low-Rank Adaptation (LoRA) is a solution to this problem by efficiently adapting pre-trained language models to downstream tasks while significantly reducing the number of trainable parameters.

This document decomposes the famous October 2021 paper describing the technique.

The Intrinsic Rank Hypothesis that underpins LoRA

In a 2020 paper called "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning" from the team at Facebook it was found that pre-trained language models can still learn efficiently even when their parameters are randomly projected onto a.

The intrinsic rank hypothesis is a crucial concept that underlies the development of LoRA.

The hypothesis suggests that the model's parameters lie in a lower-dimensional subspace than the full parameter space, referred to as the "intrinsic dimension" of the model.

intrinsic dimension represents the minimum number of dimensions required to accurately capture the important information in the model's parameter space.

The intrinsic rank hypothesis extends this idea to weight updates that occur during fine-tuning of language models.

It posits that the updates to the weights also have a low "intrinsic rank", meaning they can be well-approximated by a low-rank matrix.

This hypothesis is significant because it implies that the necessary adjustments to the model during fine-tuning can be captured using a lower-dimensional representation, rather than updating the entire weight matrix.

By exploiting the intrinsic low rank of the weight updates, LoRA can significantly reduce the number of trainable parameters while still allowing for effective adaptation to downstream model tasks.

This not only makes fine-tuning more computationally efficient but also enables the adaptation of large language models on resource-constrained devices.

The intrinsic rank hypothesis is a fundamental principle that guides the design and implementation of LoRA.

What are low rank matrices?

A matrix is a rectangular array of numbers arranged in rows and columns.

The rank of a matrix is the maximum number of linearly independent rows or columns in the matrix.

In other words, it's the dimension of the vector space spanned by the matrix's rows or columns.

Consider a matrix A:

A = [1 2 3]
    [2 4 6]
    [3 6 9]

In this matrix, we can see that the second row is 2 times the first row, and the third row is 3 times the first row.

This means that the rows are linearly dependent. We can express any row as a linear combination of the other rows.

Similarly, the second column is 2 times the first column, and the third column is 3 times the first column. The columns are also linearly dependent.

In this case, the rank of the matrix is 1. Despite the matrix being 3x3, it only contains one independent piece of information.

Now, let's look at the concept of lower-rank matrices.

A matrix is considered to be of lower rank if its rank is less than the minimum of its number of rows and columns. In the example above, the matrix has a rank of 1, which is lower than min(3, 3) = 3, so it is a lower-rank matrix.

The idea of lower-rank matrices is used in many applications, such as:

Data Compression: By approximating a matrix with a lower-rank matrix, we can store less data while preserving the most important information.

Recommendation Systems: User-item matrices in recommendation systems are often of lower rank because user preferences can be described by a smaller number of latent factors.

Image Processing: Many operations in image processing, such as image denoising and compression, exploit the fact that image matrices are often of lower rank.

The Rank-Nullity Theorem states that for a linear map (which can be represented by a matrix) between two vector spaces, the dimension of the domain (number of columns) equals the sum of the rank (dimension of the image) and the nullity (dimension of the kernel).

This theorem connects the concepts of rank and nullity, showing that they are complementary.

In summary, lower-rank matrices are matrices whose rank is less than the maximum possible given their dimensions.

They are used in many applications to simplify data, reduce dimensionality, and uncover hidden structures. The rank of a matrix can be determined by the number of linearly independent rows or columns, which are always equal, as stated by the Rank-Nullity Theorem.

Weight Matrices in Transformers

In the Transformer architecture, the self-attention layer is a key component that allows the model to attend to different positions of the input sequence.

The self-attention mechanism is applied to the input embeddings or the output of the previous layer, which we'll denote as $X$ , with shape $(sequencelength, dmodel)$ .

The self-attention layer consists of multiple attention heads that operate in parallel.

Each attention head performs the following steps:

Linearly project the input $X$ into query, key, and value representations using the corresponding weight matrices ( $Wq$ , $Wk$ , $Wv$ ).
Compute the attention scores by taking the dot product of the query and key representations.
Scale the attention scores and apply a softmax function to obtain the attention weights.
Multiply the attention weights with the value representations to get the weighted values.
Concatenate the weighted values from all attention heads and linearly project them using the output projection matrix $(Wo)$ .

In the Transformer architecture, there are four weight matrices in the self-attention module:

Now, let's focus on the dimensions of the weight matrices:

Query matrix $(Wq)$ : $(dmodel, d_q)$
Key matrix $(Wk)$ : $(dmodel, d_k)$
Value matrix $(Wv)$ : $(dmodel, d_v)$
Output projection matrix $(Wo)$ : $(dmodel, dmodel)$

Weight matrices enable the Transformer model to learn and capture complex relationships and dependencies within the input sequence. These learned weights in these matrices are crucial for the model's ability to attend to relevant information and generate meaningful representations.

In summary, the self-attention mechanism in the Transformer architecture relies on four key weight matrices: query, key, value, and output projection matrices.

These matrices linearly project the input into different subspaces, enabling the model to capture dependencies and learn meaningful representations. The dimensions of these matrices are typically determined by the model's hyperparameters, such as the dimensionality of the model and the number of attention heads.

Understanding the roles and dimensions of these weight matrices is crucial for grasping the concepts behind LoRA and the Intrinsic Rank Hypothesis, which we will explore in the next section.

How does LoRA work?

LoRA (Low-Rank Adaptation) introduces a modification to the weight matrices to efficiently adapt the pre-trained model to downstream tasks.

Let's see how LoRA gets involved in the process and influences the weights.

Recall the self-attention mechanism has four weight matrices:

Query matrix $(Wq)$
Key matrix $(Wk)$
Value matrix $(Wv)$
Output projection matrix $(Wo)$

These matrices are typically learned during the pre-training phase and have full rank.

In linear algebra, a matrix is said to have full rank if its rank is equal to the smaller of its number of rows or columns.

In the context of neural networks, this means that the weight matrices in dense layers are typically not low-rank - they cannot be exactly represented as a product of two smaller matrices.

LoRA modifies the weight matrices by introducing a low-rank decomposition of the weight updates.

Instead of directly updating the pre-trained weight matrices, LoRA represents the weight updates using two smaller matrices, $A$ and $B$ , such that:

$Wupdated = Wpretrained + ∆W$

$∆W = B * A$

where:

$Wpretrained$ is the original pre-trained weight matrix ( $Wq, Wk, Wv, Wo$ )
$∆W$ is the weight update matrix
$B$ is a matrix of size $(dmodel, r$ ), where $r$ is the rank of the decomposition
$A$ is a matrix of size $(r, dmodel)$

The key idea behind LoRA is to use a low-rank decomposition of the weight updates. By representing the weight updates as the product of two smaller matrices, A and B, LoRA allows for a more compact and efficient representation.

The rank (r) of the decomposition determines the size of these matrices and controls the expressiveness of the adaptation.

A lower rank results in fewer trainable parameters, making the adaptation more memory-efficient, while a higher rank allows for more flexibility in adapting the weights.

During the fine-tuning process with LoRA

The method involves freezing the original model weights and adjusting only two smaller matrices, $A$ and $B$ .

The pre-trained weight matrices $(Wpretrained)$ remain frozen and do not receive gradient updates.
The matrices $A$ and $B$ are initialised randomly and are the only trainable parameters. Matrix $A$ is initialised with a random Gaussian distribution, while matrix $B$ is initialised with zeros.

Gaussian initialisation

In the LoRA method, matrix A is initialised with a random Gaussian distribution, while matrix B is initialised with zeros.

The reason for using Gaussian initialisation for matrix A is to introduce randomness and break symmetry in the initial values of the weights.

When the weights of a neural network are initialised to the same value (e.g., all zeros), the network may struggle to learn meaningful patterns because all the neurons behave identically.

By initialising the weights with random values drawn from a Gaussian distribution, we ensure that the neurons start with different initial activations, allowing them to learn diverse features during training.

The choice of Gaussian initialisation is based on the principle of "symmetry breaking" and the idea that the weights should be initialised with small random values to facilitate gradient flow and prevent vanishing or exploding gradients. Gaussian initialisation has been shown to work well in practice and is commonly used in deep learning.

In the context of LoRA, initialising matrix $A$ with a Gaussian distribution ensures that the initial weight update matrix $∆W$ (which is the product of $A$ and $B$ ) has random, small values.

This allows the model to gradually adapt the pre-trained weights to the downstream task during fine-tuning, starting from a point of random perturbation.

By initialising matrix $B$ with zeros, the initial weight update matrix $∆W$ is effectively zero, meaning that the model starts with the original pre-trained weights.

As training progresses, the values of $A$ and $B$ are updated based on the gradients, allowing the model to learn the necessary adaptations for the specific task.

The combination of Gaussian initialisation for matrix of $A$ and zero initialisation for matrix of $B$ in LoRA ensures a balanced starting point for fine-tuning, facilitating the learning of task-specific adaptations while leveraging the knowledge captured in the pre-trained weights.

In the forward pass, the input is multiplied with both the pre-trained weight matrix $(Wpretrained)$ and the LoRA weight update matrix $(∆W = B * A)$ . The results are then summed element-wise to obtain the updated output.
During backpropagation, gradients are computed only for the $A$ and $B$ matrices, while the pre-trained weight matrices remain unchanged.
The optimisation process updates the $A$ and $B$ matrices based on the gradients, allowing the model to adapt to the downstream task.

Forward Pass and Backpropagation explanation

Forward Pass

During the forward pass with LoRA, the input is multiplied by both the pre-trained weight matrix $Wpretrained$ and the weight update matrix $∆W$ . The updated output is computed as follows:

$output = input × Wpretrained + input × ∆W ∆W = B × A$

The pre-trained weight matrix $Wpretrained$ remains frozen, while the matrices $A$ and $B$ are learned during fine-tuning.

The output of the self-attention layer is obtained by summing the results of the matrix multiplications.

Backpropagation

During backpropagation, the gradients are computed with respect to the input and the trainable parameters.

In LoRA, only the matrices matrices $A$ and $B$ are updated based on the gradients, while the pre-trained weight matrix $Wpretrained$ remains unchanged.

The gradients of the loss with respect to $A$ and $B$ are computed using the chain rule:

$∂loss / ∂A = (∂loss / ∂∆W) × B^T ∂loss / ∂B = (∂loss / ∂∆W)^T × A$

The optimiser then uses these gradients to update the values of $A$ and $B$

By representing the weight updates using a low-rank decomposition $(∆W = B * A)$ , LoRA significantly reduces the number of trainable parameters.

The rank r determines the size of the matrices $A$ and $B$ and controls the expressiveness of the adaptation.

A smaller rank r results in fewer trainable parameters and more efficient adaptation, while a larger rank r allows for more flexibility in adapting the weights.

What are the low rank matrices?

In the Low-Rank Adaptation (LoRA) method proposed in this paper, the terms $A$ and $B$ refer to the low-rank matrices used to approximate the weight update matrix $∆W$ during adaptation.

As discussed, the weight matrix $W$ being targeted is part of the Transformer architecture, specifically the weight matrices in the self-attention module.

Matrices ( A ) and ( B )

As highlighted. the authors of Lora hypothesised that during fine-tuning, the updates to the weights $(∆W)$ have a low "intrinsic rank", meaning they can be well-approximated by a low-rank matrix.

This means that significant changes to the neural network can be captured using a lower-dimensional representation.

Essentially, the idea is that not all elements of ( $Δ W$ ) are equally important; instead, a smaller subset of these changes can effectively encapsulate the necessary adjustments.

Building on this hypothesis, LoRA proposes representing ( $Δ W$ ) as the product of two smaller matrices, $( A )$ and $( B )$ , with a lower rank.

$Δ W$ denotes the relative change to the initial value when it has been trained.

The updated weight matrix $( W’ )$ thus becomes:

$[ W’ = W + BA ]$

In this equation, $( W )$ remains frozen (i.e., it is not updated during training).

The matrices $( B )$ and $( A )$ are of lower dimensionality, with their product $( BA )$ representing a low-rank approximation of $( Δ W )$ .

Impact of Lower Rank on Trainable Parameters

By choosing matrices $( A )$ and $( B )$ to have a lower rank $( r )$ , the number of trainable parameters is significantly reduced.

For example, if $( W )$ is a $( d x d )$ matrix, traditionally, updating $( W )$ would involve $( d² )$ parameters.

However, with $( B )$ and $( A )$ of sizes $( d X r )$ and $( r Xd )$ respectively, the total number of parameters reduces to $( 2dr )$ , which is much smaller when $( r << d ).$

The reduction in the number of trainable parameters, as achieved through the Low-Rank Adaptation (LoRA) method, offers several significant benefits, particularly when fine-tuning large-scale neural networks:

What does 'rank' mean?

The term "rank" in the context of LoRA refers to the rank of the weight update matrix $∆W$ , which is approximated by the product of two smaller matrices $A$ and $B$ .

The idea here is to exploit the fact that matrices can contain “duplicate information” in the form of linear dependence.

This means we can use factorisation to represent a large matrix in terms of two smaller matrices.

This is similar to how a large number can be represented as the multiplication of two smaller numbers, a matrix can be thought of as the multiplication of two smaller matrices.

The rank of a matrix is the maximum number of linearly independent rows or columns in the matrix.

In other words, it is the dimension of the vector space spanned by the rows or columns of the matrix.

Example: Consider the following matrix M:

[1 2 3] 
[2 4 6] 
[3 6 9]

The rows of this matrix are not linearly independent because the third row is a linear combination of the first two rows $(3 * row1 = row3)$ .

Similarly, the columns are not linearly independent because the third column is a linear combination of the first two columns $(3 * column1 = column3).$

To find the rank of the matrix, we can use Gaussian elimination to convert the matrix into row echelon form.

Row echelon form is a type of matrix in which:

All nonzero rows are above any rows of all zeros.
The leading entry of each nonzero row after the first occurs to the right of the leading entry of the previous row.
The leading entry in any nonzero row is 1.
All entries in the column below a leading 1 are zeros.

Using Gaussian elimination, a matrix is transformed into row echelon form. This helps determine properties such as the rank and solve linear systems.

Gaussian elimination is a method used to solve systems of linear equations, find the rank r, and calculate the determinant of a matrix.

The process involves three main steps:

Forward Elimination: Transform the matrix into an upper triangular form.
Pivoting: Swap rows to position the highest absolute value as the pivot element to avoid division by zero and improve numerical stability.
Back Substitution: Solve for the variables starting from the last row upwards.

By converting the matrix into row echelon form, Gaussian elimination simplifies the system, making it easier to understand its properties and solutions.

After performing Gaussian elimination, we get matrix M:

[1 2 3]
[0 1 1]
[0 0 0]

The number of non-zero rows in the row echelon form is the rank of the matrix.

In this case, the rank is 2, which is the maximum number of linearly independent rows or columns in the matrix M.

In the context of LoRA, the rank r determines the dimensionality of the subspace in which the weight update matrix $∆W$ is approximated.

By choosing a lower rank r, we enforce a low-dimensional structure on the weight update matrix, reducing the number of trainable parameters while still capturing the most important aspects of the adaptation.

So to sum up, in LoRA, the rank r is the hyperparameter that determines the size of the matrices A and B.

Specifically:

Matrix $A$ has shape $(r, dmodel)$ , where r is the rank of the decomposition and $dmodel$ is the dimension of the model.
Matrix $B$ has dimensions $(dmodel, r)$

So, the updated equation is:

$Wupdated = Wpretrained + ∆W$

$∆W = B * A$

where:

$Wpretrained$ is the original pre-trained weight matrix $(dmodel, dmodel)$
$∆W$ is the weight update matrix $(dmodel, dmodel)$
$B$ is a matrix of size $(dmodel, r)$
$A$ is a matrix of size $(r, dmodel)$

The product of $A$ and $B$ results in a matrix $∆W$ of shape $(dmodel, dmodel)$ , which has rank r.

By choosing a smaller value for r, we enforce a low-rank structure on the weight update matrix $∆W$ .

Expressiveness and Rank

The rank r of the weight update matrix ∆W controls the expressiveness of the adaptation.

A higher rank allows for more flexibility in adapting the weights, as it can capture more complex patterns and transformations. However, increasing the rank also increases the number of trainable parameters.

On the other hand, a lower rank restricts the expressiveness of the adaptation but results in fewer trainable parameters.

This is because the matrices $( A )$ and $( B )$ have fewer elements when r is smaller.

A low-rank approximation of the weight update matrix can still capture the most important aspects of the adaptation while being more parameter-efficient.

The choice of rank r depends on the complexity of the downstream task and the available computational resources. It is a trade-off between expressiveness and efficiency.

The consensus is that when the data is similar to the data used in pre-training, a low rank r value is probably sufficient. When fine tuning on very new tasks, which might require substantial logical changes within the model, a high rank r value may work better.

Applying LoRA to Transformer Self Attention Weights

LoRA can be applied to any or all of the self-attention weight matrices $(Wq, Wk, Wv, Wo)$ in each Transformer layer.
The paper primarily focuses on adapting only the query $(Wq)$ and value $(Wv)$ matrices because they play the most critical role in capturing and transforming the input representations.
During the forward pass, the adapted weight matrix $W'$ is computed as $W' = W + BA$ , where $W$ is the pre-trained weight matrix and $'$ is the low-rank adaptation term.
This allows for efficient adaptation without introducing additional inference latency, as the low-rank matrices can be merged with the pre-trained weights during deployment.

The key idea behind LoRA is to exploit the low-rank structure of the adaptation matrix $∆W$ .

By approximating $∆W$ with low-rank matrices $A$ and $B$ , LoRA significantly reduces the number of trainable parameters while still allowing for effective adaptation to downstream tasks.

This makes it suitable for adapting large pre-trained language models, where standard fine-tuning would be prohibitively expensive in terms of storage and computation.

The problem with full fine tuning

To remind us of the problem LoRA is solving, below is a brief description of the language modeling problem for a full fine tuning. In particular, the maximisation of conditional probabilities given a task-specific prompt.

$PΦ(y|x)$ represents a pre-trained autoregressive language model, where Φ denotes the model parameters.
$Z = {(xi, yi)}i=1,..,N$ is a dataset of context-target pairs for a downstream task, where $xi$ is the context and $yi$ is the target sequence.
$Φ0$ represents the initial pre-trained weights of the model.
$∆Φ$ represents the task-specific parameter increment during fine-tuning.
$Θ$ is a smaller set of parameters used to encode $∆Φ$ .

Language Modeling Objective

The goal is to adapt a pre-trained language model to downstream conditional text generation tasks.

The model learns to maximise the conditional probability of the target sequence given the context.

During full fine-tuning, the objective is to maximise the sum of log probabilities of each token $yt$ in the target sequence $y$ , conditioned on the context $x$ and the previous tokens $y<t$ .

This is achieved by updating the model parameters from $Φ0$ to $Φ0 + ∆Φ$ through :

The notation $(x, y) ∈ Z$ indicates that the summation is performed over all context-target pairs in the dataset $Z.$ $|y|$ denotes the length of the target sequence $y$ .

Parameter-Efficient Approach

The main drawback of full fine-tuning is that for each downstream task, a separate set of parameters ∆Φ is learned, which has the same dimension as the pre-trained weights Φ0.

This can be challenging to store and deploy, especially for large models.

To address this, the authors propose a parameter-efficient approach where the task-specific parameter increment ∆Φ is encoded by a much smaller set of parameters Θ, such that |Θ| << |Φ0|.

The objective becomes:

Here, ∆Φ is a function of Θ, denoted as ∆Φ(Θ).

The goal is to find the optimal Θ that maximises the conditional language modeling objective.

Low-Rank Representation

The authors propose to use a low-rank representation to encode $∆Φ$ , which is both compute- and memory-efficient.

This means that $∆Φ$ is represented as a product of smaller matrices, reducing the number of parameters needed to store and update during fine-tuning.

The key idea is to significantly reduce the number of trainable parameters $|Θ|$ compared to the size of the pre-trained weights $|Φ0|$ .

For example, when using GPT-3 175 billion parameter model as the pre-trained model, the number of trainable parameters $|Θ|$ can be as small as 0.01% of $|Φ0|$ , greatly reducing the storage and computational requirements for fine-tuning on downstream tasks.

Issues with existing solutions

While there have been other Parameter Efficient Fine Tuning (PEFT) solutions for efficient model adaptation in transfer learning, such as adding adapter layers or optimising input layer activations, have limitations, especially in large-scale and latency-sensitive production scenarios.

We discuss below:

Adapter Layers

Adapter layers, as proposed by Houlsby et al. (2019) and Lin et al. (2020), are additional layers inserted into the Transformer architecture to enable parameter-efficient fine-tuning.

While adapters have fewer parameters compared to the original model, they introduce extra computation that must be processed sequentially, leading to increased inference latency.

The latency issue becomes more prominent in online inference settings where the batch size is small (for example a batch size of 1).

Even with a small bottleneck dimension in the adapter layers, the added latency can be significant (e.g., 20-30% slower for GPT-2 medium on a single GPU). This problem is amplified when using model parallelism techniques, as the additional depth requires more synchronous GPU operations.

Optimising Input Layer Activations (Prompt Tuning)

Another PEFT approach is prefix tuning (Li & Liang, 2021), which directly optimises a portion of the input layer activations (the prompt) while keeping the pre-trained model unchanged.

However, this method faces optimisation challenges and exhibits non-monotonic performance changes with respect to the number of trainable parameters.

Non-Monotonic Performance Changes in Prompt Tuning

Non-monotonic performance changes refer to fluctuations in model performance that do not consistently improve or degrade as the number of trainable parameters increases.

In the context of prompt tuning, this means that increasing the number of trainable parameters does not guarantee a corresponding increase in model performance.

Instead, performance may improve up to a point, then degrade, and potentially improve again, creating a non-linear relationship. These irregular performance patterns can complicate the optimisation process and make it challenging to determine the optimal parameter configuration for prompt tuning.

Lora in Practice

What subset of weight matrices should be adopted for maximum downstream performance?

The authors experimented with applying LoRA to different subsets of the self-attention weight matrices when training GPT3, with a budget of 18 million parameters (this compares to GPT-3's 175 billion parameters! The weight matrices are as follows:

$Wq (query)$

$Wk (key)$

$Wv (value)$

$Wo (output)$

They found that adapting both the query weights $Wq$ and value weights $Wv$ yielded the best performance on downstream tasks like and .

Adapting $Wq$ , $Wk$ , $Wv$ , and $Wo$ together also performed well, but adapting only $Wq$ or $Wk$ resulted in significantly lower performance.

This suggests that even with a low rank (e.g., r=4), adapting multiple weight matrices captures more useful information than adapting a single weight matrix with a higher rank.

What is the optimal rank for the adaptation matrix ∆W

The authors investigated the effect of the LoRA rank r on downstream performance.

Surprisingly, they found that a rank as low as r=1 was sufficient for adapting both $Wq$ and $Wv$ on the datasets they tested, while adapting $Wq$ alone required a larger rank r.

They further analysed the subspace similarity between the learned adaptation matrices with different ranks (r=8 and r=64) and found that the top singular vector directions overlapped significantly, while the other directions did not.

This suggests that the additional directions learned with higher ranks might contain mostly random noise.

The authors conclude that the optimal adaptation matrix ∆W can indeed have a very low intrinsic rank, although they note that this may not hold for every task or dataset.

Connection between ∆W and W

To investigate the relationship between the adaptation matrix $∆W$ and the pre-trained weight matrix $W$ , the authors projected $W$ onto the r-dimensional subspace of $∆W$ and compared the .

They found that $∆W$ has a stronger correlation with $W$ compared to a random matrix, indicating that $∆W$ amplifies some features that are already present in $W$ .

However, instead of amplifying the top singular directions of $W$ , $∆W$ emphasises directions that are not as prominent in $W$ .

The amplification factor is quite large (e.g., 21.5 for r=4 in the 48th layer of GPT-3).

This suggests that the low-rank adaptation matrix amplifies important features for specific downstream tasks that were learned but not emphasised in the general pre-training model.

Process for determining the optimal rank r for LoRA when fine-tuning

Start with a low rank r (e.g., r=1 or r=2) and fine-tune the model on the downstream task.
Gradually increase the rank r (e.g., r=4, r=8) and compare the performance on a validation set.
If increasing the rank leads to significant improvements, continue increasing rank r until the performance gains plateau or the computational cost becomes too high.
If the performance is already good with a low rank, try adapting additional weight matrices (e.g., $Wq$ and $Wv$ together) with the same low rank.
Compare the performance and computational cost of different combinations of rank and adapted weight matrices to find the optimal configuration for the specific downstream task and resource constraints.

Keep in mind that the optimal rank may vary depending on the complexity of the downstream task, the size of the dataset, and the similarity between the pre-training and downstream domains.

It's also important to consider the trade-off between performance and computational efficiency when choosing the rank.

Contents of A and B

The matrices $( A )$ and $( B )$ contain learned parameters that are updated during the fine-tuning process. They are initialised randomly at the beginning of training:

Matrix matrices is initialised with a random Gaussian distribution.
Matrix $( B )$ is initialised with zeros.

During training, the values in $( A )$ and $( B )$ are updated based on the gradients computed during backpropagation.

These matrices learn to adapt the pre-trained weights to the specific downstream task by capturing the important patterns and transformations needed for the adaptation.

The content of $( A )$ and $( B )$ is learned through the optimisation process and depends on the specific task and dataset being fine-tuned on.

The learned values in these matrices represent the low-rank adaptation that modifies the pre-trained weights to better suit the downstream task.

Why LoRA is Better!

LoRA addresses the limitations of existing solutions by introducing a more efficient and effective approach to model adaptation:

No Inference Latency: Unlike adapter layers, LoRA does not introduce additional depth to the model. The low-rank adaptation matrices can be merged with the pre-trained weights after fine-tuning, resulting in no extra inference latency compared to a fully fine-tuned model.

Compute and Memory Efficiency: LoRA uses a low-rank representation to encode the task-specific parameter increments, significantly reducing the number of trainable parameters. This makes fine-tuning more compute- and memory-efficient, especially for large models.

Optimisation Stability: Compared to prompt tuning, LoRA optimises the model parameters directly, which leads to more stable optimisation and monotonic performance improvements with increased model capacity.

Sequence Length Preservation: LoRA does not require reserving a portion of the sequence length for adaptation, allowing the full sequence length to be used for downstream tasks, potentially leading to better performance.

Flexibility and Composability: LoRA is agnostic to the training objective and can be easily integrated with existing models and architectures. It is also composable with other adaptation techniques, such as prefix tuning, offering further flexibility.

Enhanced Compatibility: Works well alongside other fine-tuning techniques like adapters and prefix tuning.

Conclusion

LoRA (Low-Rank Adaptation) is a technique for fine-tuning large language models that takes advantage of the intrinsically low rank of the weight matrices.

Instead of updating the entire weight matrix during fine-tuning, LoRA decomposes the weight update matrix into two smaller matrices (A and B) with a lower rank r.

This significantly reduces the number of trainable parameters and memory requirements while maintaining performance comparable to full fine-tuning.

Key Insights

Swappable LoRA Modules

One of the most significant advantages of LoRA is the ability to swap different fine-tuned LoRA modules for various tasks on a single base model.

This allows for a more flexible and efficient deployment of models, as you can easily switch between tasks without needing separate fully fine-tuned models for each task.

Inference Time Swapping

The swappable nature of LoRA modules can be used even at inference time.

This means that customers can choose which task they want the model to perform on-the-fly, without requiring multiple models to be running simultaneously. This is a powerful feature that sets LoRA apart from other adaptation methods.

Potential for Further Optimisation

While LoRA is often applied to the attention weights (specifically query and value matrices) in transformer models, the technique could potentially be applied to other weight matrices in the model.

Exploring the application of LoRA to different components of the model architecture could lead to further optimizations and improvements.

Balancing Rank and Performance

The rank of the low-rank matrices (A and B) in LoRA is a crucial hyperparameter that determines the trade-off between model performance and efficiency. While lower ranks lead to greater memory savings and faster training, it's essential to find the right balance for each specific task. Experimenting with different rank values and evaluating the results on a validation set can help determine the optimal configuration.

Implications for Model Accessibility

By significantly reducing the memory requirements and training costs, LoRA makes fine-tuning large language models more accessible to a wider range of researchers and practitioners.

This could accelerate the development and deployment of specialized models for various tasks and domains.

Handling large datasets

Fine-tuning with LoRA on large datasets may still require significant computational resources, even though the number of trainable parameters is reduced. Strategies such as data parallelism, model parallelism, or gradient accumulation can be employed to handle large datasets efficiently.

PreviousWhat is Low-Rank Adaptation (LoRA) - explained by the inventor NextPractical Tips for Fine-tuning LMs Using LoRA (Low-Rank Adaptation)

Last updated 3 months ago