# Low Rank Adaptation (Lora)

Fine-tuning large language models has become increasingly challenging due to the vast number of parameters involved and the computational resources required.&#x20;

As models grow in size, traditional fine-tuning methods become impractical and costly.&#x20;

<mark style="color:blue;">**Low-Rank Adaptation (LoRA)**</mark> is a solution to this problem by efficiently adapting pre-trained language models to downstream tasks while significantly reducing the number of trainable parameters.&#x20;

This document decomposes the famous <mark style="color:blue;">**October 2021**</mark> paper describing the technique.

{% embed url="<https://arxiv.org/abs/2106.09685>" %}
LoRA: Low-Rank Adaptation of Large Language Models - cited over 2,000 times
{% endembed %}

### <mark style="color:purple;">The Intrinsic Rank Hypothesis that underpins LoRA</mark> <a href="#a64b" id="a64b"></a>

In a 2020 paper called <mark style="color:blue;">**"Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning"**</mark> from the team at Facebook it was found that pre-trained language models can still learn efficiently *<mark style="color:yellow;">**even when their parameters are randomly projected onto a**</mark>*[ *<mark style="color:yellow;">**smaller subspace**</mark>*](#user-content-fn-1)[^1].&#x20;

The intrinsic rank hypothesis is a crucial concept that underlies the development of LoRA. &#x20;

The hypothesis suggests that the model's parameters lie in a lower-dimensional subspace than the full parameter space, referred to as the <mark style="color:blue;">**"intrinsic dimension"**</mark> of the model.

intrinsic dimension represents the minimum number of dimensions required to accurately capture the important information in the model's parameter space.

The intrinsic rank hypothesis *<mark style="color:yellow;">**extends this idea to weight updates that occur during fine-tuning of language models**</mark>*.&#x20;

It posits that the updates to the weights also have a low <mark style="color:blue;">**"intrinsic rank"**</mark>**,** meaning they can be well-approximated by a <mark style="color:blue;">**low-rank matrix**</mark>.  &#x20;

This hypothesis is significant because it implies that the necessary adjustments to the model during fine-tuning can be captured using a lower-dimensional representation, *<mark style="color:yellow;">**rather than updating the entire weight matrix**</mark>*.

By exploiting the <mark style="color:blue;">**intrinsic low rank**</mark> of the <mark style="color:blue;">**weight updates**</mark>, LoRA can significantly reduce the number of trainable parameters while still allowing for effective adaptation to downstream model tasks.&#x20;

This not only makes fine-tuning more computationally efficient but also enables the adaptation of large language models on resource-constrained devices.&#x20;

The <mark style="color:blue;">**intrinsic rank hypothesis**</mark> is a fundamental principle that guides the design and implementation of LoRA.

<details>

<summary><mark style="color:green;"><strong>What are low rank matrices?</strong></mark></summary>

A <mark style="color:blue;">**matrix**</mark> is a rectangular array of numbers arranged in <mark style="color:blue;">**rows and columns**</mark>.&#x20;

The <mark style="color:blue;">**rank of a matrix**</mark> is the *<mark style="color:yellow;">**maximum number of linearly independent rows or columns**</mark>* in the matrix.&#x20;

In other words, it's the dimension of the vector space spanned by the matrix's rows or columns.

Consider a <mark style="color:blue;">**matrix**</mark> A:

```yaml
A = [1 2 3]
    [2 4 6]
    [3 6 9]
```

In this matrix, we can see that the second row is 2 times the first row, and the third row is 3 times the first row.&#x20;

This means that the rows are <mark style="color:blue;">**linearly dependent**</mark>. We can express any row as a linear combination of the other rows.

Similarly, the second column is 2 times the first column, and the third column is 3 times the first column. The columns are also <mark style="color:blue;">**linearly dependent**</mark>.

In this case, the rank of the matrix is 1. Despite the matrix being 3x3, it only contains one independent piece of information.

Now, let's look at the *<mark style="color:yellow;">**concept of lower-rank matrices**</mark>.*&#x20;

A matrix is considered to be of lower rank if its <mark style="color:yellow;">**rank is less than the minimum of its number of rows and columns**</mark>. In the example above, the matrix has a rank of 1, which is lower than min(3, 3) = 3, so it is a lower-rank matrix.

The idea of lower-rank matrices is used in many applications, such as:

<mark style="color:blue;">**Data Compression**</mark>: By approximating a matrix with a lower-rank matrix, we can store less data while preserving the most important information.

<mark style="color:blue;">**Recommendation Systems:**</mark> User-item matrices in recommendation systems are often of lower rank because user preferences can be described by a smaller number of latent factors.

<mark style="color:blue;">**Image Processing:**</mark> Many operations in image processing, such as image denoising and compression, exploit the fact that image matrices are often of lower rank.

The <mark style="color:blue;">**Rank-Nullity Theorem**</mark> states that for a linear map (which can be represented by a matrix) between two vector spaces, the dimension of the domain (number of columns) equals the sum of the rank (dimension of the image) and the nullity (dimension of the kernel).&#x20;

This theorem connects the concepts of rank and nullity, showing that they are complementary.

In summary, lower-rank matrices are matrices whose rank is less than the maximum possible given their dimensions.&#x20;

They are used in many applications to simplify data, reduce dimensionality, and uncover hidden structures. The rank of a matrix can be determined by the number of linearly independent rows or columns, which are always equal, as stated by the Rank-Nullity Theorem.

</details>

### <mark style="color:purple;">**Weight Matrices in Transformers**</mark>

In the Transformer architecture, the <mark style="color:blue;">**self-attention layer**</mark> is a key component that allows the model to attend to different positions of the <mark style="color:blue;">**input sequence**</mark>.&#x20;

The <mark style="color:blue;">**self-attention mechanism**</mark> is applied to the <mark style="color:blue;">**input embeddings**</mark> or the output of the previous layer, which we'll denote as $$X$$, with <mark style="color:blue;">**shape**</mark> $$(sequencelength, dmodel)$$.

The self-attention layer consists of <mark style="color:blue;">**multiple attention heads that operate in parallel**</mark>.&#x20;

Each <mark style="color:blue;">**attention head**</mark> performs the following steps:

1. Linearly project the <mark style="color:blue;">**input**</mark> $$X$$ into <mark style="color:blue;">**query**</mark>, <mark style="color:blue;">**key**</mark>, and <mark style="color:blue;">**value**</mark> representations using the corresponding <mark style="color:blue;">**weight matrices**</mark> ( $$Wq$$, $$Wk$$, $$Wv$$).
2. Compute the <mark style="color:blue;">**attention scores**</mark> by taking the <mark style="color:blue;">**dot product**</mark> of the <mark style="color:blue;">**query**</mark> and <mark style="color:blue;">**key**</mark> representations.
3. Scale the attention scores and apply a <mark style="color:blue;">**softmax function**</mark> to obtain the <mark style="color:blue;">**attention weights**</mark>.
4. <mark style="color:blue;">**Multiply the attention weights**</mark> with the <mark style="color:blue;">**value representations**</mark> to get the <mark style="color:blue;">**weighted values**</mark>.
5. <mark style="color:blue;">**Concatenate the weighted values**</mark> from all attention heads and linearly project them using the <mark style="color:blue;">**output projection matrix**</mark> $$(Wo)$$.

In the Transformer architecture, there are <mark style="color:yellow;">**four**</mark>**&#x20;**<mark style="color:blue;">**weight matrices**</mark>**&#x20;in** the self-attention module:&#x20;

Now, let's focus on the <mark style="color:yellow;">**dimensions of the weight matrices**</mark>:

* <mark style="color:blue;">**Query matrix**</mark> $$(Wq)$$: $$(dmodel, d\_q)$$
* <mark style="color:blue;">**Key matrix**</mark> $$(Wk)$$: $$(dmodel, d\_k)$$
* <mark style="color:blue;">**Value matrix**</mark> $$(Wv)$$: $$(dmodel, d\_v)$$
* <mark style="color:blue;">**Output projection matrix**</mark> $$(Wo)$$: $$(dmodel, dmodel)$$

<mark style="color:blue;">**Weight matrices**</mark> enable the Transformer model to learn and capture complex relationships and dependencies within the input sequence.  These learned weights in these matrices are crucial for the model's ability to attend to relevant information and generate meaningful representations.

In summary, the self-attention mechanism in the Transformer architecture relies on four key weight matrices: query, key, value, and output projection matrices.&#x20;

These matrices linearly project the input into different subspaces, enabling the model to capture dependencies and learn meaningful representations.  The dimensions of these matrices are typically determined by the model's hyperparameters, such as the dimensionality of the model and the number of attention heads.

Understanding the roles and dimensions of these weight matrices is crucial for grasping the concepts behind LoRA and the Intrinsic Rank Hypothesis, which we will explore in the next section.&#x20;

### <mark style="color:purple;">How does LoRA work?</mark>

LoRA (Low-Rank Adaptation) introduces a *<mark style="color:yellow;">**modification to the weight matrices**</mark>* to efficiently adapt the pre-trained model to downstream tasks.&#x20;

Let's see how LoRA gets involved in the process and influences the weights.

Recall the <mark style="color:blue;">**self-attention mechanism**</mark> has four <mark style="color:blue;">**weight matrices**</mark>:

* <mark style="color:blue;">**Query matrix**</mark> $$(Wq)$$
* <mark style="color:blue;">**Key matrix**</mark> $$(Wk)$$
* <mark style="color:blue;">**Value matrix**</mark> $$(Wv)$$
* <mark style="color:blue;">**Output projection matrix**</mark> $$(Wo)$$

These matrices are typically learned during the pre-training phase and have <mark style="color:blue;">**full rank**</mark>.

&#x20;In linear algebra, a *<mark style="color:yellow;">matrix is said to have</mark> <mark style="color:yellow;"></mark><mark style="color:yellow;">**full rank**</mark> <mark style="color:yellow;"></mark><mark style="color:yellow;">if its rank is equal to the smaller of its number of rows or columns</mark>*.&#x20;

In the context of neural networks, this means that the <mark style="color:blue;">**weight matrices**</mark> in dense layers are typically not low-rank - they cannot be exactly represented as a product of two smaller matrices.

LoRA <mark style="color:yellow;">**modifies the weight matrices**</mark> by introducing a low-rank decomposition of the weight updates.&#x20;

Instead of directly updating the pre-trained weight matrices, LoRA represents the weight updates *<mark style="color:yellow;">**using two smaller matrices**</mark>*, $$A$$ and $$B$$, such that:

$$Wupdated = Wpretrained + ∆W$$

&#x20;$$∆W = B \* A$$

where:

* $$Wpretrained$$ is the <mark style="color:blue;">**original pre-trained weight matrix**</mark> ($$Wq, Wk, Wv, Wo$$)
* $$∆W$$ is the <mark style="color:blue;">**weight update matrix**</mark>
* $$B$$ is a <mark style="color:blue;">**matrix**</mark> of size $$(dmodel, r$$), where $$r$$ is the rank of the decomposition
* $$A$$ is a <mark style="color:blue;">**matrix**</mark> of size $$(r, dmodel)$$

The key idea behind LoRA is to use a low-rank decomposition of the weight updates.  By representing the weight updates as the product of two smaller matrices, A and B, LoRA allows for a more compact and efficient representation.&#x20;

The <mark style="color:blue;">**rank (r)**</mark> of the decomposition determines the size of these matrices and controls the expressiveness of the adaptation. &#x20;

A lower rank results in fewer trainable parameters, making the adaptation more memory-efficient, while a higher rank allows for more flexibility in adapting the weights.

### <mark style="color:purple;">During the</mark> <mark style="color:purple;"></mark><mark style="color:purple;">**fine-tuning process**</mark> <mark style="color:purple;"></mark><mark style="color:purple;">with LoRA</mark>

The method involves freezing the original model weights and <mark style="color:yellow;">**adjusting only two smaller matrices**</mark>, $$A$$ <mark style="color:blue;">**and**</mark> $$B$$.

1. The <mark style="color:blue;">**pre-trained weight matrices**</mark> $$(Wpretrained)$$ <mark style="color:yellow;">**remain frozen**</mark> and do not receive gradient updates.
2. The <mark style="color:blue;">**matrices**</mark> $$A$$ and $$B$$ are initialised randomly and *<mark style="color:yellow;">**are the only trainable parameters**</mark>*. Matrix $$A$$ is initialised with a random Gaussian distribution, while matrix $$B$$is initialised with zeros.

<details>

<summary><mark style="color:green;"><strong>Gaussian initialisation</strong></mark></summary>

In the LoRA method, matrix A is initialised with a random Gaussian distribution, while matrix B is initialised with zeros.&#x20;

The reason for using Gaussian initialisation for matrix A is to *<mark style="color:yellow;">**introduce randomness and break symmetry**</mark>* in the initial values of the weights.

When the weights of a neural network are initialised to the same value (e.g., all zeros), the network *<mark style="color:yellow;">**may struggle to learn meaningful patterns because all the neurons behave identically**</mark>*.

By initialising the weights with random values drawn from a Gaussian distribution, we ensure that the neurons start with different initial activations, allowing them to learn diverse features during training.

The choice of Gaussian initialisation is based on the principle of <mark style="color:blue;">**"symmetry breaking"**</mark> and the idea that the weights should be initialised with <mark style="color:blue;">**small random values**</mark> to facilitate gradient flow and prevent vanishing or exploding gradients. Gaussian initialisation has been shown to work well in practice and is commonly used in deep learning.

In the context of LoRA, initialising matrix $$A$$ with a Gaussian distribution ensures that the initial weight update matrix $$∆W$$ (which is the product of $$A$$ and $$B$$) has random, small values. &#x20;

This allows the model to *<mark style="color:yellow;">**gradually adapt the pre-trained weights to the downstream task during fine-tuning**</mark>*, starting from a point of random perturbation.

By initialising matrix $$B$$ with zeros, the initial weight update matrix $$∆W$$ is effectively zero, meaning that the model starts with the original pre-trained weights.&#x20;

As training progresses, the values of $$A$$ and $$B$$ are updated based on the gradients, allowing the model to learn the necessary adaptations for the specific task.

The combination of Gaussian initialisation for matrix of $$A$$and zero initialisation for matrix of $$B$$ in LoRA ensures a balanced starting point for fine-tuning, facilitating the learning of task-specific adaptations while leveraging the knowledge captured in the pre-trained weights.

</details>

1. In the <mark style="color:blue;">**forward pass**</mark>, the input is multiplied with both the <mark style="color:blue;">**pre-trained weight matrix**</mark> $$(Wpretrained)$$ and the <mark style="color:blue;">**LoRA weight update matrix**</mark> $$(∆W = B \* A)$$. The results are then <mark style="color:blue;">**summed element-wise**</mark> to obtain the updated output.
2. During <mark style="color:blue;">**backpropagation**</mark>, gradients are computed only for the $$A$$ and $$B$$  <mark style="color:blue;">**matrices**</mark>, while the <mark style="color:blue;">**pre-trained weight matrices**</mark> remain unchanged.
3. The optimisation process updates the $$A$$ and $$B$$ <mark style="color:blue;">**matrices**</mark> based on the <mark style="color:blue;">**gradients**</mark>, allowing the model to adapt to the downstream task.

<details>

<summary><mark style="color:green;">Forward Pass and Backpropagation explanation</mark></summary>

<mark style="color:blue;">**Forward Pass**</mark>

During the forward pass with LoRA, the input is multiplied by both the pre-trained weight matrix $$Wpretrained$$ and the weight update matrix $$∆W$$. The updated output is computed as follows:

$$output = input × Wpretrained + input × ∆W ∆W = B × A$$

The pre-trained weight matrix $$Wpretrained$$ remains frozen, while the matrices $$A$$ and $$B$$ are learned during fine-tuning.&#x20;

The output of the self-attention layer is obtained by summing the results of the matrix multiplications.

<mark style="color:blue;">**Backpropagation**</mark>

During backpropagation, the *<mark style="color:yellow;">**gradients are computed with respect to the input and the trainable parameters**</mark>*.&#x20;

In LoRA, only the matrices matrices $$A$$ and $$B$$ are updated based on the gradients, while the pre-trained weight matrix $$Wpretrained$$ remains unchanged.

The gradients of the loss with respect to $$A$$ and $$B$$  are computed using the chain rule:

$$∂loss / ∂A = (∂loss / ∂∆W) × B^T ∂loss / ∂B = (∂loss / ∂∆W)^T × A$$

The optimiser then uses these gradients to update the values of $$A$$ and $$B$$

</details>

By representing the weight updates using a <mark style="color:blue;">**low-rank decomposition**</mark> $$(∆W = B \* A)$$, LoRA significantly reduces the number of trainable parameters.&#x20;

The <mark style="color:blue;">**rank r**</mark> <mark style="color:yellow;">**determines the**</mark><mark style="color:yellow;">**&#x20;**</mark>*<mark style="color:yellow;">**size**</mark>*<mark style="color:yellow;">**&#x20;**</mark><mark style="color:yellow;">**of the matrices**</mark> $$A$$ and $$B$$ and *<mark style="color:yellow;">**controls the expressiveness**</mark>* of the adaptation. &#x20;

A smaller <mark style="color:blue;">**rank r**</mark> results in fewer trainable parameters and more efficient adaptation, while a larger <mark style="color:blue;">**rank r**</mark> allows for more flexibility in adapting the weights.

### <mark style="color:purple;">What are the low rank matrices?</mark>

In the Low-Rank Adaptation (LoRA) method proposed in this paper, the terms $$A$$ and $$B$$ refer to the <mark style="color:blue;">**low-rank matrices**</mark> used to approximate the <mark style="color:blue;">**weight update matrix**</mark> $$∆W$$during adaptation.&#x20;

As discussed, the <mark style="color:blue;">**weight matrix**</mark> $$W$$ being targeted is *<mark style="color:yellow;">**part of the Transformer architecture**</mark>*, specifically the weight matrices in the self-attention module.

### <mark style="color:purple;">Matrices ( A ) and ( B )</mark> <a href="#id-230a" id="id-230a"></a>

As highlighted. the authors of Lora hypothesised that during fine-tuning, the updates to the weights $$(∆W)$$ *<mark style="color:yellow;">**have a low "intrinsic rank"**</mark>*, meaning they can be well-approximated by a <mark style="color:blue;">**low-rank matrix**</mark>.

This means that significant changes to the neural network can be captured <mark style="color:yellow;">**using a lower-dimensional representation**</mark>.&#x20;

Essentially, the idea is that <mark style="color:yellow;">**not all elements of**</mark>**&#x20;(**$$Δ W$$ <mark style="color:yellow;">**) are equally important**</mark>; instead, a smaller subset of these changes can effectively encapsulate the necessary adjustments.

Building on this hypothesis, LoRA proposes representing ($$Δ W$$ ) as the <mark style="color:blue;">**product**</mark> of two smaller matrices, $$( A )$$and $$( B )$$, *<mark style="color:yellow;">**with a lower rank**</mark>*. &#x20;

$$Δ W$$ denotes the *<mark style="color:yellow;">**relative change**</mark>* to the initial value when it has been trained.

The <mark style="color:blue;">**updated weight matrix**</mark> $$( W’ )$$ thus becomes:

$$\[ W’ = W + BA ]$$

In this equation, $$( W )$$remains frozen (i.e., <mark style="color:yellow;">**it is not updated during training**</mark>).&#x20;

The <mark style="color:blue;">**matrices**</mark> $$( B )$$ and $$( A )$$are of lower dimensionality, with their <mark style="color:blue;">**product**</mark> $$( BA )$$representing a low-rank approximation of $$( Δ W )$$.

### <mark style="color:purple;">Impact of Lower Rank on Trainable Parameters</mark> <a href="#id-1131" id="id-1131"></a>

By choosing <mark style="color:blue;">**matrices**</mark> $$( A )$$ and $$( B )$$ to have a <mark style="color:blue;">**lower rank**</mark> $$( r )$$, the number of trainable parameters is significantly reduced.&#x20;

For example, if $$( W )$$ is a $$( d x d )$$ <mark style="color:blue;">**matrix**</mark>, traditionally, updating $$( W )$$ would involve $$( d² )$$ <mark style="color:blue;">**parameters**</mark>.&#x20;

However, with  $$( B )$$ and  $$( A )$$  of sizes $$( d  X  r )$$and $$( r Xd )$$ respectively, the total number of <mark style="color:blue;">**parameters**</mark> reduces to $$( 2dr )$$, which is much smaller when $$( r << d ).$$

The reduction in the number of trainable parameters, as achieved through the Low-Rank Adaptation (LoRA) method, offers several significant benefits, particularly when fine-tuning large-scale neural networks:

### <mark style="color:purple;">What does 'rank' mean?</mark>

The term "rank" in the context of LoRA refers to the <mark style="color:blue;">**rank of the weight update matrix**</mark> $$∆W$$, which is approximated by the <mark style="color:blue;">**product**</mark> of two smaller <mark style="color:blue;">**matrices**</mark> $$A$$ and $$B$$.&#x20;

The idea here is to exploit the fact that matrices can contain “duplicate information” in the form of linear dependence.  &#x20;

This means we can use factorisation to <mark style="color:yellow;">**represent a large matrix in terms of two smaller matrices**</mark>.

This is similar to how a large number can be represented as the multiplication of two smaller numbers, a matrix can be thought of as the multiplication of two smaller matrices.

The <mark style="color:blue;">**rank**</mark> of a <mark style="color:blue;">**matrix**</mark> is the maximum *<mark style="color:yellow;">**number of linearly independent rows or columns in the matrix**</mark>*.

In other words, it is the dimension of the vector space spanned by the rows or columns of the matrix.

<mark style="color:green;">**Example:**</mark> Consider the following <mark style="color:blue;">**matrix**</mark> M:

```yaml
[1 2 3] 
[2 4 6] 
[3 6 9]
```

The rows of this matrix are <mark style="color:blue;">**not linearly independent**</mark> because the third row is a linear combination of the first two rows $$(3 \* row1 = row3)$$.&#x20;

Similarly, the <mark style="color:blue;">**columns are not linearly independent**</mark> because the third column is a linear combination of the first two columns $$(3 \* column1 = column3).$$

To find the rank of the matrix, we can use <mark style="color:blue;">**Gaussian elimination**</mark> to convert the matrix into <mark style="color:blue;">**row echelon form**</mark>.&#x20;

<mark style="color:blue;">**Row echelon form**</mark> is a type of matrix in which:

1. All nonzero rows are above any rows of all zeros.
2. The leading entry of each nonzero row after the first occurs to the right of the leading entry of the previous row.
3. The leading entry in any nonzero row is 1.
4. All entries in the column below a leading 1 are zeros.

Using <mark style="color:blue;">**Gaussian elimination**</mark>, a matrix is transformed into row echelon form. This helps determine properties such as the rank and solve linear systems.

<mark style="color:blue;">**Gaussian elimination**</mark> is a method used to <mark style="color:yellow;">**solve systems of linear equations**</mark>, find the <mark style="color:blue;">**rank r**</mark>, and calculate the determinant of a matrix. &#x20;

The process involves three main steps:

1. <mark style="color:purple;">**Forward Elimination:**</mark> Transform the matrix into an upper triangular form.
2. <mark style="color:purple;">**Pivoting:**</mark> Swap rows to position the highest absolute value as the pivot element to avoid division by zero and improve numerical stability.
3. <mark style="color:purple;">**Back Substitution:**</mark> Solve for the variables starting from the last row upwards.

By converting the <mark style="color:blue;">**matrix**</mark> into row <mark style="color:blue;">**echelon form**</mark>, Gaussian elimination <mark style="color:yellow;">**simplifies the system**</mark>, making it easier to understand its properties and solutions.

After performing Gaussian elimination, we get <mark style="color:blue;">**matrix M**</mark>:

```
[1 2 3]
[0 1 1]
[0 0 0]
```

The number of non-zero rows in the <mark style="color:blue;">**row echelon form**</mark> is the rank of the <mark style="color:blue;">**matrix**</mark>.

In this case, the rank is 2, which is the maximum number of linearly independent rows or columns in the <mark style="color:blue;">**matrix**</mark> M.

In the context of LoRA, the <mark style="color:blue;">**rank r**</mark> determines the dimensionality of the subspace in which the <mark style="color:blue;">**weight update matrix**</mark> $$∆W$$ is approximated.&#x20;

By choosing a lower <mark style="color:blue;">**rank r**</mark>, we enforce a low-dimensional structure on the weight update matrix, reducing the number of trainable parameters while still capturing the most important aspects of the adaptation.

So to sum up, in LoRA, the <mark style="color:blue;">**rank r**</mark> is the hyperparameter that determines the <mark style="color:blue;">**size of the matrices A and B**</mark>.&#x20;

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2F3tO0ng9bByY7MqN1BLAk%2Fimage.png?alt=media&#x26;token=947809fd-2cb2-489a-ae00-8ae57f5f1938" alt=""><figcaption><p>A conceptual diagram of LoRA with an r value equal to 1 and 2. In both examples the decomposed A and B matrices result in the same sized change matrix, but r=2 is able to encode more linearly independent information into the change matrix, due to having more information in the A and B matrices.  Source: <a href="https://medium.com/@danielwarfield1?source=post_page-----e944a6bff46b--------------------------------">Daniel Warfield</a></p></figcaption></figure>

<mark style="color:purple;">**Specifically:**</mark>

* Matrix $$A$$ has shape $$(r, dmodel)$$, where r is the <mark style="color:blue;">**rank of the decomposition**</mark> and $$dmodel$$ is the <mark style="color:blue;">**dimension of the model**</mark>.
* Matrix $$B$$ has <mark style="color:blue;">**dimensions**</mark> $$(dmodel, r)$$

So, the updated equation is:

$$Wupdated = Wpretrained + ∆W$$&#x20;

$$∆W = B \* A$$

<mark style="color:purple;">**where:**</mark>

* $$Wpretrained$$ is the original <mark style="color:blue;">**pre-trained weight matrix**</mark> $$(dmodel, dmodel)$$
* $$∆W$$is the <mark style="color:blue;">**weight update matrix**</mark> $$(dmodel, dmodel)$$
* $$B$$ is a <mark style="color:blue;">**matrix of size**</mark> $$(dmodel, r)$$
* $$A$$ is a <mark style="color:blue;">**matrix of size**</mark> $$(r, dmodel)$$

The <mark style="color:blue;">**product**</mark> of $$A$$ and $$B$$ results in a <mark style="color:blue;">**matrix**</mark> $$∆W$$of <mark style="color:blue;">**shape**</mark> $$(dmodel, dmodel)$$, which has <mark style="color:blue;">**rank r**</mark>.&#x20;

By choosing a smaller value for r, we enforce a low-rank structure on the <mark style="color:blue;">**weight update matrix**</mark> $$∆W$$.

#### <mark style="color:green;">**Expressiveness and Rank**</mark>

The <mark style="color:blue;">**rank r**</mark> of the <mark style="color:blue;">**weight update matrix**</mark> ∆W controls the <mark style="color:blue;">**expressiveness**</mark> of the adaptation.&#x20;

A *<mark style="color:yellow;">**higher rank allows for more flexibility in adapting the weights**</mark>*, as it can capture more complex patterns and transformations.  However, increasing the rank also increases the number of trainable parameters.

On the other hand, a *<mark style="color:yellow;">**lower rank restricts the expressiveness of the adaptation**</mark>* but results in fewer trainable parameters.&#x20;

This is because the  <mark style="color:blue;">**matrices**</mark> $$( A )$$ and $$( B )$$ have fewer elements when <mark style="color:blue;">**r**</mark> is smaller.

A low-rank approximation of the <mark style="color:blue;">**weight update matrix**</mark> can still capture the most important aspects of the adaptation while being more parameter-efficient.

The choice of <mark style="color:blue;">**rank r**</mark> depends on the complexity of the downstream task and the available computational resources. It is a trade-off between expressiveness and efficiency.

The consensus is that when the data is similar to the data used in pre-training, a low <mark style="color:blue;">**rank r**</mark> value is probably sufficient.  When fine tuning on very new tasks, which might require substantial logical changes within the model, a high <mark style="color:blue;">**rank r**</mark> value may work better.

### <mark style="color:purple;">**Applying LoRA to Transformer Self Attention Weights**</mark>

* LoRA <mark style="color:yellow;">**can be applied to any or all of the self-attention weight matrices**</mark> $$(Wq, Wk, Wv, Wo)$$ in each Transformer layer.
* The paper primarily focuses on adapting only the <mark style="color:blue;">**query**</mark> $$(Wq)$$ and <mark style="color:blue;">**value**</mark> $$(Wv)$$ matrices because they play the most critical role in capturing and transforming the input representations.&#x20;
* During the <mark style="color:blue;">**forward pass**</mark>, the <mark style="color:blue;">**adapted weight matrix**</mark> $$W'$$ is computed as $$W' = W + BA$$, where $$W$$ is the <mark style="color:blue;">**pre-trained weight matrix**</mark> and $$'$$ is the low-rank adaptation term.
* This allows for efficient adaptation *<mark style="color:yellow;">**without introducing additional inference latency**</mark>*, as the low-rank matrices can be merged with the pre-trained weights during deployment.

The key idea behind LoRA is to <mark style="color:yellow;">**exploit the low-rank structure of the adaptation matrix**</mark> $$∆W$$.&#x20;

By approximating $$∆W$$ with <mark style="color:blue;">**low-rank matrices**</mark> $$A$$ and $$B$$, LoRA significantly reduces the number of trainable parameters while still allowing for effective adaptation to downstream tasks.&#x20;

This makes it suitable for adapting large pre-trained language models, where standard fine-tuning would be prohibitively expensive in terms of storage and computation.

### <mark style="color:purple;">The problem with full fine tuning</mark>

To remind us of the problem LoRA is solving, below is a brief description of the language modeling problem for a full fine tuning. In particular, the maximisation of conditional probabilities given a task-specific prompt.

* $$PΦ(y|x)$$ represents a <mark style="color:blue;">**pre-trained autoregressive language model**</mark>, where Φ denotes the <mark style="color:blue;">**model parameters**</mark>.
* $$Z = {(xi, yi)}i=1,..,N$$ is a <mark style="color:blue;">**dataset of context-target pairs**</mark> for a downstream task, where $$xi$$ is the <mark style="color:blue;">**context**</mark> and $$yi$$ is the <mark style="color:blue;">**target sequence**</mark>.
* $$Φ0$$represents the initial <mark style="color:blue;">**pre-trained weights**</mark> of the model.
* $$∆Φ$$ represents the <mark style="color:blue;">**task-specific parameter**</mark> increment during fine-tuning.
* $$Θ$$is a <mark style="color:blue;">**smaller set of parameters**</mark> used to encode $$∆Φ$$.

### <mark style="color:green;">**Language Modeling Objective**</mark>

#### The goal is to adapt a pre-trained language model to downstream conditional text generation tasks.   &#x20;

#### The model learns to <mark style="color:yellow;">**maximise the conditional probability of the target sequence**</mark> given the context.

During <mark style="color:blue;">**full fine-tuning**</mark>, the objective is to maximise the sum of log probabilities of each <mark style="color:blue;">**token**</mark> $$yt$$ in the <mark style="color:blue;">**target sequence**</mark> $$y$$, conditioned on the <mark style="color:blue;">**context**</mark> $$x$$ and the <mark style="color:blue;">**previous tokens**</mark> $$y\<t$$.&#x20;

This is achieved by updating the <mark style="color:blue;">**model parameters**</mark> from $$Φ0$$ to $$Φ0 + ∆Φ$$through [<mark style="color:blue;">**gradient descent**</mark>](#user-content-fn-2)[^2]:

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2FxnAX2YT4vdG6oiTbYjAs%2Fimage.png?alt=media&#x26;token=a6419d7c-aa02-4339-98ac-a3ab1381aaef" alt=""><figcaption></figcaption></figure>

The notation $$(x, y) ∈ Z$$ indicates that the <mark style="color:yellow;">**summation is performed over all context-target pairs**</mark> in the dataset $$Z.$$   $$|y|$$denotes the <mark style="color:blue;">**length of the target sequence**</mark> $$y$$.

#### <mark style="color:green;">**Parameter-Efficient Approach**</mark>

The main drawback of full fine-tuning is that for each downstream task, a <mark style="color:blue;">**separate set of parameters**</mark> ∆Φ is learned, which <mark style="color:yellow;">**has the**</mark><mark style="color:yellow;">**&#x20;**</mark>*<mark style="color:yellow;">**same dimension**</mark>*<mark style="color:yellow;">**&#x20;**</mark><mark style="color:yellow;">**as the pre-trained weights**</mark> Φ0.&#x20;

This can be challenging to store and deploy, especially for large models.

To address this, the authors propose a parameter-efficient approach where the <mark style="color:yellow;">**task-specific parameter increment**</mark> ∆Φ is encoded by a much <mark style="color:yellow;">**smaller set of parameters**</mark> Θ, such that |Θ| << |Φ0|.

The objective becomes:

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2F3Z0oplOtfItGJbCrk4rG%2Fimage.png?alt=media&#x26;token=133491c6-9d25-469a-b14d-65d191f445a8" alt=""><figcaption></figcaption></figure>

Here, ∆Φ is a function of Θ, denoted as ∆Φ(Θ).&#x20;

The goal is to find the optimal Θ that maximises the <mark style="color:yellow;">**conditional language modeling objective**</mark>.

### <mark style="color:green;">**Low-Rank Representation**</mark>

The authors propose to use a <mark style="color:blue;">**low-rank representation**</mark> to encode $$∆Φ$$, which is both compute- and memory-efficient.&#x20;

This means that $$∆Φ$$ is represented as a <mark style="color:blue;">**product of smaller matrices**</mark>, reducing the number of parameters needed to store and update during fine-tuning.

The key idea is to <mark style="color:yellow;">**significantly reduce the number of trainable parameters**</mark> $$|Θ|$$ compared to the size of the pre-trained weights $$|Φ0|$$.

For example, when using GPT-3 175 billion parameter model as the pre-trained model, the <mark style="color:yellow;">**number of trainable parameters**</mark> $$|Θ|$$can be as *<mark style="color:yellow;">**small as 0.01%**</mark>* of $$|Φ0|$$, greatly reducing the storage and computational requirements for fine-tuning on downstream tasks.

### <mark style="color:purple;">Issues with existing solutions</mark>

While there have been other <mark style="color:blue;">**Parameter Efficient Fine Tuning (PEFT)**</mark> solutions for efficient model adaptation in transfer learning, such as <mark style="color:blue;">**adding adapter layers**</mark> or <mark style="color:blue;">**optimising input layer activations,**</mark> have limitations, especially in large-scale and latency-sensitive production scenarios.  &#x20;

<mark style="color:purple;">**We discuss below:**</mark>

### <mark style="color:green;">Adapter Layers</mark>

Adapter layers, as proposed by Houlsby et al. (2019) and Lin et al. (2020), are <mark style="color:yellow;">**additional layers inserted into the Transformer architecture**</mark> to enable parameter-efficient fine-tuning.&#x20;

While adapters have fewer parameters compared to the original model, they <mark style="color:yellow;">**introduce extra computation**</mark> that must be processed sequentially, leading to increased inference latency.

The latency issue becomes more prominent in online inference settings where the batch size is small (for example a batch size of 1).&#x20;

Even with a small bottleneck dimension in the adapter layers, the added latency can be significant (e.g., 20-30% slower for GPT-2 medium on a single GPU).  This problem is amplified when using model parallelism techniques, as the additional depth requires more synchronous GPU operations.

### <mark style="color:green;">**Optimising Input Layer Activations (Prompt Tuning)**</mark>

&#x20;Another PEFT approach is prefix tuning (Li & Liang, 2021), which *<mark style="color:yellow;">**directly optimises a portion of the input layer activations**</mark>* (the prompt) while keeping the pre-trained model unchanged.&#x20;

However, this method faces optimisation challenges and exhibits <mark style="color:blue;">**non-monotonic performance**</mark> changes with respect to the number of trainable parameters.

#### <mark style="color:blue;">Non-Monotonic Performance Changes in Prompt Tuning</mark>

<mark style="color:blue;">**Non-monotonic performance**</mark> changes refer to <mark style="color:yellow;">**fluctuations in model performance**</mark> that do not consistently improve or degrade as the number of trainable parameters increases.

In the context of prompt tuning, this means that *<mark style="color:yellow;">**increasing the number of trainable parameters does not guarantee a corresponding increase in model performance**</mark>*.&#x20;

Instead, performance may improve up to a point, then degrade, and potentially improve again, creating a non-linear relationship. These irregular performance patterns can complicate the optimisation process and make it challenging to determine the optimal parameter configuration for prompt tuning.

### <mark style="color:purple;">Lora in Practice</mark>

#### <mark style="color:green;">**What subset of weight matrices should be adopted for maximum downstream performance?**</mark>

The authors experimented with applying LoRA to different subsets of the <mark style="color:blue;">**self-attention weight matrices**</mark> when training GPT3, with a budget of 18 million parameters (this compares to GPT-3's 175 billion parameters!  The <mark style="color:blue;">**weight matrices**</mark> are as follows:&#x20;

$$Wq (query)$$

$$Wk (key)$$

$$Wv (value)$$

$$Wo (output)$$

They found that adapting both the <mark style="color:blue;">**query weights**</mark> $$Wq$$and <mark style="color:blue;">**value weights**</mark> $$Wv$$ yielded the best performance on downstream tasks like WikiSQL[^3] and MultiNLI[^4].&#x20;

Adapting $$Wq$$, $$Wk$$, $$Wv$$, and $$Wo$$ together also performed well, but adapting only $$Wq$$ or $$Wk$$ resulted in <mark style="color:yellow;">**significantly lower performance**</mark>.

This suggests that even with a <mark style="color:blue;">**low rank**</mark> (e.g., r=4), adapting multiple <mark style="color:blue;">**weight matrices**</mark> captures more useful information than adapting a <mark style="color:blue;">**single weight matrix**</mark> with a <mark style="color:blue;">**higher rank.**</mark>

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2F5K6G2NIUzazPQ2Ecfc5g%2Fimage.png?alt=media&#x26;token=18bdc31c-a6a7-4770-8013-630c7d8ab2f1" alt=""><figcaption></figcaption></figure>

### <mark style="color:green;">**What is the optimal rank for the adaptation matrix ∆W**</mark>

The authors investigated the effect of the LoRA <mark style="color:blue;">**rank r**</mark> on downstream performance.

Surprisingly, they found that a <mark style="color:yellow;">**rank as low as r=1 was sufficient**</mark> for adapting both $$Wq$$ and $$Wv$$ on the datasets they tested, while adapting $$Wq$$ alone required a larger <mark style="color:blue;">**rank r**</mark>.

They further analysed the subspace similarity between the learned adaptation matrices with different ranks (r=8 and r=64) and found that the *<mark style="color:yellow;">**top singular vector directions overlapped significantly**</mark>*, while the other directions did not.&#x20;

This suggests that the additional directions learned with higher ranks might contain mostly random noise.

The authors conclude that the optimal <mark style="color:blue;">**adaptation matrix**</mark> ∆W can indeed have a very low intrinsic rank, although they note that this may not hold for every task or dataset.

### <mark style="color:green;">**Connection between ∆W and W**</mark>

To investigate the relationship between the <mark style="color:blue;">**adaptation matrix**</mark> $$∆W$$ and the <mark style="color:blue;">**pre-trained weight matrix**</mark> $$W$$, the authors projected $$W$$ onto the r-dimensional subspace of $$∆W$$ and compared the [<mark style="color:blue;">**Frobenius norms**</mark>](#user-content-fn-5)[^5].

They found that $$∆W$$ has a stronger correlation with $$W$$<mark style="color:yellow;">**compared to a random matrix**</mark>, indicating that $$∆W$$amplifies some features that are already present in $$W$$.&#x20;

However, instead of amplifying the top singular directions of  $$W$$, $$∆W$$ emphasises directions that are not as prominent in $$W$$.

The amplification factor is quite large (e.g., 21.5 for r=4 in the 48th layer of GPT-3).

This suggests that the low-rank adaptation matrix <mark style="color:yellow;">**amplifies important features for specific downstream tasks**</mark> that were *<mark style="color:yellow;">**learned but not emphasised in the general pre-training model**</mark>*.

### <mark style="color:green;">**Process for determining the optimal rank r for LoRA when fine-tuning**</mark>

1. Start with a low <mark style="color:blue;">**rank r**</mark> (e.g., r=1 or r=2) and fine-tune the model on the downstream task.
2. Gradually increase the <mark style="color:blue;">**rank r**</mark> (e.g., r=4, r=8) and compare the performance on a validation set.
3. If increasing the rank leads to significant improvements, continue increasing <mark style="color:blue;">**rank r**</mark> until the performance gains plateau or the computational cost becomes too high.
4. If the performance is already good with a low rank, try *<mark style="color:yellow;">**adapting additional weight matrices**</mark>* (e.g., $$Wq$$and $$Wv$$ together) with the same low rank.
5. Compare the performance and computational cost of different combinations of rank and adapted weight matrices to find the optimal configuration for the specific downstream task and resource constraints.

Keep in mind that the optimal rank may vary depending on the complexity of the downstream task, the size of the dataset, and the similarity between the pre-training and downstream domains.&#x20;

It's also important to consider the trade-off between performance and computational efficiency when choosing the rank.

### <mark style="color:purple;">Contents of A and B</mark>

The <mark style="color:blue;">**matrices**</mark> $$( A )$$ and $$( B )$$ contain learned parameters that are updated during the fine-tuning process. They are initialised randomly at the beginning of training:

* Matrix <mark style="color:blue;">**matrices**</mark> is initialised with a random Gaussian distribution.
* Matrix $$( B )$$is initialised with zeros.

During training, the values in $$( A )$$ and $$( B )$$  are updated based on the gradients computed during <mark style="color:blue;">**backpropagation**</mark>.&#x20;

These matrices learn to adapt the <mark style="color:blue;">**pre-trained weights**</mark> to the specific downstream task by capturing the important patterns and transformations needed for the adaptation.

The content of $$( A )$$ and $$( B )$$  is learned through the optimisation process and depends on the specific task and dataset being fine-tuned on.&#x20;

The learned values in these matrices represent the low-rank adaptation that modifies the pre-trained weights to better suit the downstream task.

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2FuTd17Rqiqma6z6jDRthh%2Fimage.png?alt=media&#x26;token=9496956f-a9b4-45a3-b75e-86941e35a9f7" alt=""><figcaption><p>Backpropagation is a fundamental algorithm used in training neural networks and other differentiable machine learning models. It is a method for efficiently calculating the gradients of the model's parameters with respect to the loss function. The goal of backpropagation is to update the model's parameters in a way that minimizes the difference between the predicted output and the desired output.</p></figcaption></figure>

### <mark style="color:purple;">Why LoRA is Better!</mark>

LoRA addresses the limitations of existing solutions by introducing a more efficient and effective approach to model adaptation:

<mark style="color:blue;">**No Inference Latency:**</mark> Unlike adapter layers, LoRA does not introduce additional depth to the model. The low-rank adaptation matrices can be <mark style="color:yellow;">**merged with the pre-trained weights after fine-tuning**</mark>, resulting in no extra inference latency compared to a fully fine-tuned model.

<mark style="color:blue;">**Compute and Memory Efficiency:**</mark> LoRA uses a low-rank representation to encode the task-specific parameter increments, significantly reducing the number of trainable parameters. This makes fine-tuning more compute- and memory-efficient, especially for large models.

<mark style="color:blue;">**Optimisation Stability:**</mark> Compared to prompt tuning, LoRA optimises the model parameters directly, which leads to more stable optimisation and monotonic performance improvements with increased model capacity.

<mark style="color:blue;">**Sequence Length Preservation:**</mark> LoRA does not require reserving a portion of the sequence length for adaptation, allowing the full sequence length to be used for downstream tasks, potentially leading to better performance.

<mark style="color:blue;">**Flexibility and Composability:**</mark> LoRA is agnostic to the training objective and can be easily integrated with existing models and architectures. It is also composable with other adaptation techniques, such as prefix tuning, offering further flexibility.

<mark style="color:blue;">**Enhanced Compatibility**</mark><mark style="color:blue;">:</mark> Works well alongside other fine-tuning techniques like adapters and prefix tuning.

## <mark style="color:purple;">Conclusion</mark> <a href="#id-3d50" id="id-3d50"></a>

LoRA (Low-Rank Adaptation) is a technique for fine-tuning large language models that takes advantage of the intrinsically low rank of the weight matrices.&#x20;

Instead of updating the entire weight matrix during fine-tuning, LoRA decomposes the weight update matrix into two smaller matrices (A and B) with a lower <mark style="color:blue;">**rank r**</mark>.&#x20;

This significantly reduces the number of trainable parameters and memory requirements while maintaining performance comparable to full fine-tuning.

### <mark style="color:purple;">Key Insights</mark>

#### <mark style="color:blue;">**Swappable LoRA Modules**</mark>

One of the most significant advantages of LoRA is the ability to swap different fine-tuned LoRA modules for various tasks on a single base model.&#x20;

This allows for a more flexible and efficient deployment of models, as you can easily switch between tasks without needing separate fully fine-tuned models for each task.

#### <mark style="color:blue;">**Inference Time Swapping**</mark>

The swappable nature of LoRA modules can be used even at inference time.&#x20;

This means that customers can choose which task they want the model to perform on-the-fly, without requiring multiple models to be running simultaneously. This is a powerful feature that sets LoRA apart from other adaptation methods.

#### <mark style="color:blue;">**Potential for Further Optimisation**</mark>

While LoRA is often applied to the attention weights (specifically query and value matrices) in transformer models, the technique could potentially be *<mark style="color:yellow;">**applied to other weight matrices in the model**</mark>*.&#x20;

Exploring the application of LoRA to different components of the model architecture could lead to further optimizations and improvements.

#### <mark style="color:blue;">**Balancing Rank and Performance**</mark>

The *<mark style="color:yellow;">**rank of the low-rank matrices (A and B) in LoRA is a crucial hyperparameter**</mark>* that determines the trade-off between model performance and efficiency. While lower ranks lead to greater memory savings and faster training, it's essential to find the right balance for each specific task. Experimenting with different rank values and evaluating the results on a validation set can help determine the optimal configuration.

#### <mark style="color:blue;">**Implications for Model Accessibility**</mark>

By significantly reducing the memory requirements and training costs, LoRA makes fine-tuning large language models more accessible to a wider range of researchers and practitioners.&#x20;

This could accelerate the development and deployment of specialized models for various tasks and domains.

<mark style="color:blue;">**Handling large datasets**</mark>

Fine-tuning with LoRA on large datasets may still require significant computational resources, even though the number of trainable parameters is reduced. Strategies such as data parallelism, model parallelism, or gradient accumulation can be employed to handle large datasets efficiently.

[^1]: Subspace refers to a smaller dimensional space within the larger parameter space of the model. Essentially, it means that the model's parameters can be <mark style="color:green;">**represented or projected onto a simpler and smaller set of dimensions**</mark>, allowing the model to perform effectively without needing the full complexity of the original parameter space.

[^2]: Gradient descent is an optimisation algorithm used to minimise the loss function by iteratively moving towards the direction of steepest descent as defined by the negative of the gradient. At each step, the model parameters are adjusted in the opposite direction of the gradient, scaled by a learning rate. This process continues until the model converges to a local minimum of the loss function. It's a fundamental technique used in training machine learning models. including neural networks.

[^3]: **WikiSQL** is a benchmark dataset for natural language to SQL translation. The task involves converting natural language queries into SQL queries that can be executed on a relational database.  <mark style="color:green;">**Example task:**</mark> Given the question "What is the population of France?" and a table of countries with their populations, the model needs to generate the correct SQL query to retrieve the population of France.

[^4]: **MultiNLI** is a dataset used for the natural language inference task. The goal is to determine the relationship between a pair of sentences: whether the second sentence (hypothesis) is entailed by, contradicts, or is neutral with respect to the first sentence (premise).

[^5]: The **Frobenius norm** of a matrix is a measure of its magnitude - calculated as the square root of the sum of the absolute squares of its elements.   This norm gives a sense of the overall size of the entries in the matrix and is used in optimisation and machine learning to compare magnitudes of different matrices.
